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Abstract 


In many engineering domains, systems are composed of partially independent 
subsystems—power systems are composed of distribution and transmission 
systems, teams of robots are composed of individual robots, and chemical 
process systems are composed of vessels, heat exchangers and reactors. Often, 
these subsystems should reach a common goal such as satisfying a power 
demand with minimum cost, flying in a formation, or reaching an optimal 
set-point. At the same time, limited information exchange is desirable—for 
confidentiality reasons but also due to communication constraints. Moreover, a 
fast and reliable decision process is key as applications might be safety-critical. 


Mathematical optimization techniques are among the most successful tools for 
controlling systems optimally with feasibility guarantees. Yet, they are often 
centralized—all data has to be collected in one central and computationally 
powerful entity. Methods from distributed optimization control the subsystems 
in a distributed or decentralized fashion, reducing or avoiding central coordina- 
tion. These methods have a long and successful history. Classical distributed 
optimization algorithms, however, are typically designed for convex problems. 
Hence, they are only partially applicable in the above domains since many of 
them lead to optimization problems with non-convex constraints. 


This thesis develops one of the first frameworks for distributed and decen- 
tralized optimization with non-convex constraints. Based on the Augmented 
Lagrangian Alternating Direction Inexact Newton (ALADIN) algorithm, a bi- 
level distributed ALADIN framework is presented, solving the coordination 
step of ALADIN in a decentralized fashion. This framework can handle various 
decentralized inner algorithms, two of which we develop here: a decentralized 
variant of the Alternating Direction Method of Multipliers (ADMM) and a 
novel decentralized Conjugate Gradient algorithm. Decentralized conjugate 
gradient is to the best of our knowledge the first decentralized algorithm with 
a guarantee of convergence to the exact solution in a finite number of iter- 
ates. Sufficient conditions for fast local convergence of bi-level ALADIN are 
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derived. Bi-level ALADIN strongly reduces the communication and coordi- 
nation effort of ALADIN and preserves its fast convergence guarantees. We 
illustrate these properties on challenging problems from power systems and 
control, and compare performance to the widely used ADMM. 


The developed methods are implemented in the open-source MATLAB toolbox 
ALADIN-a—one of the first toolboxes for decentralized non-convex optimiza- 
tion. ALADIN-a comes with a rich set of application examples from different 
domains showing its broad applicability. As an additional contribution, this 
thesis provides new insights why state-of-the-art distributed algorithms might 
encounter issues for constrained problems. 


Deutsche Kurzfassung 


In vielen Anwendungen bestehen Systeme aus einer Vielzahl von Subsyste- 
men. Elektrische Energiesysteme bestehen aus Transportsystemen und Verteil- 
systemen, Roboter Teams bestehen aus einzelnen Robotern und chemische 
Prozesssysteme bestehen aus einzelnen Kesseln, Wärmetauschern und Reak- 
toren. Oftmals arbeiten diese Subsysteme auf ein gemeinsames Ziel hin, wie 
zum Beispiel die móglichst kostengünstige Deckung eines Energiebedarfs, das 
Fliegen in einer bestimmten Formation oder die kostenoptimale Ansteuerung 
eines Arbeitspunktes. Dabei ist ein móglichst geringer Informationsaustausch 
wiinschenswert—sei es aus Vertraulichkeitsgründen oder auch aus Gründen 
limitierter Datenübertragungskapazität. Zusätzlich ist ein schneller und zuver- 
lassiger Entscheidungsprozess essentiell—besonders im Falle sicherheitskri- 
tischer Anwendungen. 


Methoden der mathematischen Optimierung gehóren zu den erfolgreichsten 
Werkzeugen für die optimale Steuerung von Systemen mit Garantien. Noch 
sind diese Verfahren oftmals zentralisiert—alle Daten werden in einer zen- 
tralen, koordinierenden Einheit mit hoher Rechenleistung gesammelt. Meth- 
oden der verteilten Optimierung steuern Subsysteme auf verteilte oder dezen- 
trale Weise unter Reduktion oder Vermeidung zentraler Koordination. Diese 
Methoden haben eine lange und erfolgreiche Historie. Klassische verteilte 
Optimierungsverfahren wurden in den meisten Fallen für konvexe Probleme 
entwickelt. Damit sind sie jedoch nur teilweise auf Probleme in den oben er- 
wähnten Domänen anwendbar, da viele von ihnen auf Optimierungsprobleme 
mit nichtkonvexen Nebenbedingungen führen. 


Die vorliegende Arbeit entwickelt eine der ersten Verfahrensklassen für die 
verteilte und dezentrale Optimierung unter nichtkonvexen Nebenbedingun- 
gen. Basierend auf dem Augmented Lagrangian Alternating Direction Inexact 
Newton (ALADIN) Algorithmus wird ein zweistufiges ALADIN Verfahren 
präsentiert, welches den Koordinationsschritt ALADIN’s dezentral löst. Diese 
Verfahrensklasse ist in der Lage mit verschiedenen inneren dezentrale Ver- 
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fahren umzugehen, wobei zwei solcher Verfahren vorgestellt werden: eine 
dezentrale Variante des Alternating Direction of Multipliers Method (ADMM) 
Algorithmus und eine dezentrale Variante des konjugierte Gradienten Ver- 
fahrens. Das dezentrale konjugierte Gradienten Verfahren ist nach unserem 
Kenntnisstand das erste dezentrale Verfahren mit einer Konvergenzgarantie zur 
exakten Lósung eines Koordinationsproblems in einer endlichen Anzahl von 
Schritten. Hinreichende Bedingungen für die schnelle lokale Konvergenz des 
zweistufigen ALADIN Verfahrens werden hergeleitet. Zusätzlich zu seinen 
schnellen Konvergenzeigenschaften reduziert das zweistufige Verfahren den 
Kommunikations- und Koordinationsbedarf ALADIN's. Wir verdeutlichen 
diese Eigenschaften anhand herausfordernder Anwendungsbeispiele, die in 
elektrischen Energiesystemen sowie in der Optimalsteuerung auftreten, und 
vergleichen die Leistungsfahigkeit zu dem weitverbreiteten ADMM Verfahren. 


Die entwickelten Verfahren sind in der quelloffenen MATLAB Toolbox 
ALADIN-« implementiert—eine der ersten Toolboxen für die dezentrale 
nichtkonvexe Optimierung. ALADIN-a verfügt über vielfältige Anwendungs- 
beispiele kommend aus verschiedenen Domänen, welches ihre breite Anwend- 
barkeit unterstreicht. Als zusätzlichen Beitrag dokumentiert diese Arbeit neue 
Einsichten über die Gründe, warum aktuelle verteilte Optimierungsverfahren 
in manchen Fallen Schwierigkeiten beim Lósen von Optimierungsprobleme 
unter Nebenbedingungen haben. 
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Frequently Used Symbols 


Distributed optimization 


XZ primal variables 
A,y,u dual variables 
X feasible set 


R set of subsystems 
A set of active constraints 

S, (Su) set of (strongly) convex functions 
Ux indicator function to the set X 


p penalty parameter 
x weighting matrix 

) outer iteration index 
yr inner iteration index 
) optimal value 


Power systems 


y admittance of a branch 
Y bus admittance matrix 
G,B real/imaginary part of Y 
9. v voltage angle/magnitude 
s,p,q apparent/active/reactive power 
N set uf buses (nodes) 
UN; partition of N 
e set of branches (edges) 
(Jk quantities related bus k 
(Jii quantities related to the branch from bus k to bus / 


(8, ()5, (5 — generation/demand/storage 
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1 Introduction 


In many engineering domains, one can observe a trend towards systems com- 
posed of interconnected subsystems coordinating towards a common goal. 
Examples range from power systems, which collaborate to satisfy power de- 
mands, via robotics, collaborating to fulfill a certain task, to chemical engi- 
neering, where reactors collaborate to produce a certain product in an optimal 
fashion. Because of confidentiality reasons and communication constraints, 
this collaboration should be achieved subject to limited information exchange 
and limited central coordination while being fast and reliable. 


One particular example for such systems are electricity grids. Electricity 
grids are typically composed of smaller grids, each of which is operated 
by one system operator. All together, they aim at a cheap and safe energy 
supply. To achieve this goal, it is usually necessary to exchange grid models 
and consumption data. Exchanging this data is, however, often problematic 
because of confidentiality reasons. Moreover, avoiding a single point of failure 
(e.g. a central coordinator) is also important for security reasons. 


This thesis is about operating such systems by means of mathematical opti- 
mization techniques. 


Employing mathematical optimization 


Mathematical optimization is one of the most successful techniques for con- 
trolling systems automatically and optimally towards a goal. Mathematical 
optimization problems are typically formulated as 


min f(x) subjectto x € X, (1.1) 


where f : X — R is called the objective function, X is called the constraint 
set or feasible set and x is called the decision vector. The objective function 
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min f 
TEX 


(a) Centralized optimization. 


Ar =b inf 
5 mun 
min fı 1 à 
= min 
TEX min f3 zı E X 
£3 € Az , T3 E Az 
min f2 min f2 
TE Aa TE Xo 
(b) Distributed optimization. (c) Decentralized optimization. 


Figure 1.1: Distributed optimization, decentralized and centralized optimization. 


f encodes the goal (such as minimizing the cost of power generation in an 
electricity grid) and the constraint set X captures the underlying physical 
model (such as an electricity grid model) and engineering limitations (such as 
maximum generator outputs). Problems in form of (1.1) are typically solved by 
centralized optimization algorithms in the sense that the data of all subsystems 
is collected in one central entity. This entity solves the problem and broadcasts 
the solution to the subsystems (Figure 1.12). 


Centralized optimization is often not desirable because of information aggre- 
gation in a single entity, which implies the existence of a single point of failure. 
Hence, techniques from distributed optimization are important. Distributed 
optimization means to shift computation mainly to the subsystems with a co- 
ordinating entity (Figure 1.1b).! Distributed optimization algorithms typically 
require problems, where f and X have a special structure. One common 
structure is 


min ;(xj  subjectto x; EX; IER and Aj;x; =b. (12) 
ET ) subj 
D IER iER 


! Note that, although the terminology of “distributed algorithms" might be used slightly different 
in computer science, there is an intimate connection between these two fields. We briefly 
comment on this interconnection in Appendix D. 
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Here, the optimization problem is distributed among a set of subsystems 
R = {1,..., R}, where each subsystem has its own objective function f; (for 
example encoding the generator cost of one system operator), its own decision 
vector x; (encoding voltages and powers in the region controlled by the system 
operator) and its own constraint set A; (containing the grid model and engineer- 
ing constraints of one region). Coupling between the subsystems is considered 
via a so-called consensus constraint >);<@ Aix; = b. This consensus constraint 
encodes the need for compatible physical values at the interconnection points 
between two neighboring subsystems (such as a matching power import/export 
between neighboring system operators). 


While distributed optimization reduces the amount of central coordination, 
it still requires a coordinator. The appealing promise of decentralized opti- 
mization algorithms is to overcome the need for this coordinator, i.e. they 
avoid central coordination entirely and exchange information directly between 
neighbored subsystems (Figure 1.1c). 


Although distributed and especially decentralized optimization algorithms are 
promising candidates for controlling systems with limited information ex- 
change, classical distributed and decentralized algorithms are typically de- 
signed for convex problems. In many practical applications, however, models 
are non-linear leading to non-convex constraint sets. Hence, classical dis- 
tributed and decentralized algorithms are often not applicable—at least not 
with convergence guarantees. Moreover, classical algorithms are often slow in 
practice. 


In view of the above, the main research question of this thesis is as follows: 


How to design efficient decentralized optimization algorithms 
for problems with non-convex constraints under limited 
information exchange and guaranteeing fast convergence? 


Outline and contributions 


Next, we outline the content of this thesis and highlight contributions of each 
chapter. 
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Chapter 2—Basics of Distributed Optimization 


Chapter 2 briefly recalls basics of distributed optimization. We start with 
the fundamentals of nonlinear programming such as optimality conditions 
and prominent nonlinear programming algorithms. We continue by recalling 
two popular classical distributed optimization algorithms serving as building 
blocks for the algorithms we derive in Chapter 4 and serving as benchmark 
algorithms for numerical tests in Chapter 5. We recall the basic form of 
the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) 
algorithm, which builds the foundation of the bi-level ALADIN algorithm, 
which we will develop in Chapter 4. Chapter 2 concludes with new insights 
why classical distributed algorithms might exhibit severe issues for constrained 
problems. We show that ALADIN is able to overcome these limitations by 
its more advanced coordination step, which explicitly considers constraint and 
curvature information. 


Chapter 3—A survey on Distributed Optimization 


In Chapter 3, we provide a literature review on distributed and decentralized 
optimization algorithms from different fields. The review is new in this form, 
since most literature reviews consider distributed algorithms from one partic- 
ular community only. We conclude Chapter 3 by showing that most of the 
approaches so far have difficulties especially for problems with non-convex 
constraints. 


Chapter 4—Bi-level Distributed ALADIN 


Chapter 4 presents the main conceptual contribution of this thesis: a bi-level 
distributed ALADIN framework combining basic ALADIN with condensing 
techniques from nonlinear programming and an inner decentralization layer.” 
By using condensing techniques, we significantly lower coordination and com- 


? For the sake of readability, we will simply write bi-level ALADIN in the following. Note 
that we do not make any connection to bi-level optimization problems in the sense of nested 
optimization problems [CMS07]. 
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munication requirements of ALADIN. Decentralization is achieved by solving 
the coordination step by means of decentralized inner algorithms. To this end, 
we propose a decentralized variant of ADMM and, as an alternative, a novel 
decentralized variant of the conjugate gradient algorithm for solving the coor- 
dination step of ALADIN. Decentralized conjugate gradient has the advantage 
of converging in a finite number of iterations, while ADMM achieves at most 
linear convergence. We derive both algorithms based on a new sparsity encod- 
ing technique. As these two inner algorithms are both iterative in nature, they 
introduce inexactness to the coordination step of bi-level ALADIN. We show 
that bi-level ALADIN, nonetheless, preserves the fast convergence guarantees 
of basic ALADIN if the inaccuracy in the coordination step of ALADIN decays 
fast enough. 


In summary, the contributions of Chapter 4 are 


— one of the first decentralized optimization frameworks for problems with 
non-convex constraints (bi-level ALADIN); 


— a reduced-space variant of ALADIN, lowering its communication and 
coordination requirements; 


— adecentralized inner ADMM algorithm tailored to the coordination step 
of bi-level ALADIN; 


— anew decentralized inner conjugate gradient algorithm solving the co- 
ordination step of ALADIN in a finite number of iterations; 


— anew sparsity-encoding technique unifying the derivation of decentral- 
ized inner algorithms; 


— aconvergence analysis of bi-level ALADIN under inexact coordination. 


The results of this chapter appeared in [Eng+20]. The sparsity encoding 
technique is not published so far. 


Chapter 5—Application to Power Systems 


Chapter 5 applies ALADIN and bi-level ALADIN to Optimal Power Flow 
(OPF) problems—an important problem class from power systems. We start 
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by reviewing relevant literature on distributed OPF and conclude that state-of- 
the art algorithms such as ADMM often lack convergence guarantees and often 
exhibits slow convergence. As an alternative, we propose ALADIN variants 
and compare ALADIN, condensed ALADIN and bi-level ALADIN to ADMM 
as a state-of-the-art algorithms for distributed OPF. We provide a numerical 
comparison in terms of convergence, coordination and communication for 
practically relevant cases. Moreover, we show that ADMM leads to feasible 
but not necessarily to optimal solutions in case of high penalization parameters 
in combination with a feasible initial guess. This is particularly important in 
the context of OPF as the combination of these two assumptions is sometimes 
used in the distributed OPF literature. 


In summary, the contributions of Chapter 4 are 


— one of the first algorithms for solving distributed OPF problems with 
guarantees; 


— a performance comparison of basic ALADIN, bi-level ALADIN to 
ADMM as prominent state-of-the-art algorithm; 


— a mathematical and numerical comparison in terms of communication, 
coordination and convergence of the above algorithms; 


— amathematical and numerical convergence analysis of ADMM for high 
penalization parameters in combination with a feasible initial guess; 


— an analysis of reasons for limited accuracy in case of using bi-level 
ALADIN with ADMM as inner algorithm. 


The results of this chapter have been published in [Eng-- 19b; Eng+17; Eng+20; 
EF18], where [Eng+19b] mainly focuses on numerical tests on cases up to 300 
buses. [EF18] analyzes the convergence of ADMM in case of high penalization 
parameters in combination with a feasible initial point. 


Chapter 6—The ALADIN-a toolbox 


Chapter 6 presents an open-source toolbox, ALADIN-a, implementing AL- 
ADIN, bi-level ALADIN and the Alternating Direction Method of Multipliers 
(ADMM) in a unified framework. This toolbox is designed to be modular hav- 
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ing rapid-prototyping of distributed and decentralized optimization algorithms 
in mind. ALADIN-a supports advanced features such as parallel computing 
and parametric programming. The toolbox comes with a rich set of examples 
from different engineering fields highlighting its broad applicability. Specifi- 
cally, we illustrate possible applications of ALADIN-a beyond power systems 
on a numerical example from distributed optimal control. ALADIN-a is one 
of the first toolboxes for decentralized non-convex optimization. 


More explicitly, the contributions of ALADIN-a can be summarized as 


— one of the first toolboxes for distributed and decentralized optimization 
with non-convex constraints; 


— a modular framework for rapid-prototyping of distributed and decentral- 
ized optimization; 


— a unified interface making comparisons to state-of-the-art algorithms 
such as ADMM easy; 


— a wide range of numerical application examples from optimal control, 
power systems, sensor networks, machine learning and robotics. 


The results of this chapter have been submitted for publication in [Eng+20]. 


Chapter 7 summarizes this thesis and proposes promising directions of future 
work. 


2 Basics of Distributed Optimization 


This chapter briefly reviews basics of nonlinear programming, where f and 
X in problem (1.1) are described by continuously differentiable functions. 
We also describe widely-used distributed optimization algorithms, where the 
derivation of dual decomposition is based on [Boy+11] and [BT89]. The 
derivation of ADMM comes from [Hou+17] with the difference that here we 
explicitly consider equality constraints. The optimality conditions are mainly 
from [NWO6] and the illustrating examples in Section 3.4 are novel. 


2.1 Basics of nonlinear programming 


Nonlinear programs (NLPs) are special cases of (1.1) in form of 


min f(x) (2.1a) 
xE 
subjectto x € & = (x € R™ | g(x) 20, h(x) x 0) (2.1b) 


where g : R"x — R"s and h : R” — R™ are called equality and inequality 
constraints and x € R” is called the decision vector of (2. 1).! Here, f, g and h 
are assumed to be sufficiently often continuously differentiable. Let us endow 
the constraint set X with a norm ||- || yielding a normed space (X, || - ||). Then 
we can define minimizers to problem (2.1) as follows. 


Definition 1 (Minimizers) We call x* € X a local minimizer to problem (1.1), 
if f(x*) < f(x) forall x € 8,(x*) NX with 8,(x) = (x | |x-x|| < r},r > O. 
If f(x*) € f(x) for all x € X, we call x* a global minimizer to problem (1.1). 


! Other special cases are integer or mixed-integer problems for example, where X C R"x x Z?z; 
or infinite-dimensional problems, where X is a subset of a function space. We refer to [Gröl8] 
for an overview many important classes of practical relevance. 


2 Basics of Distributed Optimization 


If the above inequalities hold strictly, we call x* a strict local minimizer respec- 
tively strict global minimizer. We call A(x) := (i € (1,..., ng) | h;(x) = 0) 
the set of active or binding inequality constraints at a point x. We call prob- 
lem (2.1) feasible, if X is non-empty. 


2.1.1 Optimality conditions 


How can one find minimizers to (2.1)? One approach is to derive opti- 
mality conditions which (when solved) provide solution candidates to (2.1). 
Many numerical algorithms rely on the first-order necessary Karush-Kuhn- 
Tucker (KKT) conditions, which are a set of non-linear equations and non- 
linear inequalities providing candidate solution points to (2.1). To ensure that 
the KKT consitions hold true at a minimizer, one relies on constraint qual- 
ifications. Here we recall the strongest constraint qualification—the /inear 
independence constraint qualification (LICQ).? 


Definition 2 (Linear independence constraint qualification) Let x be a feasible 
point to (2.1). LICQ is said to hold at x, if Vh;(x) and Vg;(X) are linearly 
independent for all i € A(x) and for all i € {1,..., ng}. 


Now, we are ready to state the KKT conditions. 


? Weaker constraint qualifications exist, for example the Mangasarian-Fromovitz Constraint Qual- 
ification (MFCQ) or the Abadie Constraint Qualification. Moreover, there are first-order op- 
timality conditions which do not require constraint qualifications, for example the Fritz-John 
optimality conditions [GGT04; BSS13]. However, these conditions are more difficult to evalu- 
ate and thus we stick with the KKT conditions and LICQ which is a common approach in the 
literature. 
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Theorem 1 (First-order optimality KKT conditions) Let f, g and h be contin- 
uously differentiable and let x* be a minimizer to (2.1) for which LICQ holds. 
Then there exist unique vectors y* € R"s and u* € R"^ such that 


0-2 Vf(x*)4 >> yt Vgi(x*) + = urVhi(x*), (2a) 


ie (1... ng) jell...nn} 
des, (2.2b) 
0> h(x*), (2.2) 
g* 3:0; (2.24) 
O=pth(x*) forall ie {i,...,mp}. (2.2e) 


We call (x*, y*, u^) a KKT point, if it satisfies the KKT conditions (2.2). 


Next, we recall sufficient conditions for optimality. Consider the Lagrangian 
function to problem (2.1) 


L(x, y. u) = f(x) + y'g(x) + u h(x), 
where y! = (yı,.. -> Yng) and a = us. Any): 


Theorem 2 (Second-order sufficient condition [NWO6, Thm 12.6]) Let x* be 
a KKT point where LICQ holds. Morover, let V2,  (x*, y*, u*) be positive 
definite on the subspace (x € R"* | Vg(x*)x = 0). Then, x* is a strict local 
minimizer to problem (2. 1)3 

Here, we denote the Jacobian matrix of gas Vg(x) = (Vgi(x)..... Vg, ©)". 
Many numerical algorithms rely on solving (2.2) for obtaining solution can- 
didates to (2.1). The second-order sufficient conditions can then be used 
to distinguish minimizers from maximizers and saddle points. This forms 
the basis for two of the most popular algorithms for nonlinear programming: 
Sequential Quadratic Programming (SQP) and Interior Point (IP) methods. 


3 Note that the subspace assumption {x € R"* | Vg(x*)x = 0} is slightly stronger than the 
one in the literature, where this subspace can be further reduced by means of the inequality 
constraint components h; which are active at x*. However, this would make the presentation 
much more technical and would not give much additional generality for our purposes. For a 
more general version we refer to [NW06, Thm 12.6]. 
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Observe that in the special case of equality constraints only, solving (2.2) 
reduces to solving a nonlinear system of equations which can be solved via 
standard Newton methods. 


Convex problems 


An important class of problems are convex optimization problems. Hence, let 
us recall basics of convex optimization. 


Definition 3 (Convex set) A set X € R"* is called convex if the line between 
any two points xj, x» € X lies entirely in X, i.e. 6x; + (1 — 0) x2 € X for all 
0 €(0 1). 


If the above inequality holds strictly for all xj, x2 € X, xı # x2, X is called 
strictly convex. 


Definition 4 (Convex function) A function f : X > R is called convex if its 
epigraph is convex, i.e. if for all xj,x» € X, we have f(8 xı + (1 — 0) x2) € 
0 f (x1) + (1 — 0) f(x2) for all 0 € (0 1). 


If the above inequality holds strictly for all x1 # x2, f is called strictly convex. 


Definition 5 (Strongly convex function) If f is differentiable and (V f(x1) — 
Vf(x2)) Ga -%) 2 u [lxi — xli? holds for all x1, x» € X, f is called strongly 
convex with parameter u. 


We denote the set of convex functions with S and the set of strongly convex 
functions with parameter u with S,,. Note that any strongly convex function 
is strictly convex but not vice versa. 


Convex optimization problems are special cases of (1.1), where f and X are 
convex. In case of nonlinear programming problems (2.1), this means that the 
inequality constraints h are convex and the equality constraints g are affine 
[Boy+04]. A nice property of convex problems is that their minima f(x*) are 
unique (if they exist). Moreover, if f is strongly convex also the minimizer x* 
is unique. In addition, the KKT conditions (2.2) are necessary and sufficient 
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for convex problems and thus one can omit the evaluation of the conditions 
of Theorem 2. Furthermore, specialized algorithms for convex optimization 
problems exist where some of them can be solved in polynomial time. Although 
we focus on non-convex optimization in this work, we will use these results 
from convex optimization in some of our derivations. 


Next, we recall some basics from convex optimization which are essential to 
our later developments. In convex optimization, the dual problem (which is 
a maximization problem) to a (primal) minimization optimization problem is 
often useful since one can recover the solution to the primal problem from the 
solution to the dual problem under certain conditions. In many cases, the dual 
problem is considerable easier to solve than the primal problem—thus many 
algorithms focus on solving the dual problem. Recovering the solution to the 
primal problem from the dual is possible particularly in absence of a duality 
gap, for which we recall conditions next. 


Consider the partial Lagrangian of (2.1) with respect to g, L(x, y) = f(x) + 
y! g(x). Then, the dual function to (2.1) is defined as 


q(y) := inf L(x, y) = inf f(x) € y 'g(x) Q.3) 
xeX xeX 
with X := (x € R"* | h(x) x 0). Moreover, the dual problem to (2.1) is 


sup q(y). (2.4) 
Y 
This leads to the following theorem. 


Theorem 3 (Strong duality under Slater's condition [Ber99, Props 5.1.4, 5.1.5, 
5.3.1]) Let f be convex on X and let g and h be convex, let (2.1) be feasible. 
Moreover, let some point x with h(x) « 0 exist (Slater’s condition). Then, 


sup q(y) = f(x), 
Y 


inf 
g(x)20,h(x) x0 


i.e. there is no duality gap between (2.1) and (2.4). Moreover, the set of 
maximizers to (2.4), D, is the same as the set of dual solutions to (2.1) and the 
set of primal solutions is given by the minimizers to 


min L(x, y), y* ED. (2.5) 
XE 
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Roughly speaking, this means that if the problem at hand is convex and Slater's 
condition holds, one can "replace" the original problem with it's dual. After 
solving the dual problem, a primal solution x* can be recovered by means of 
(2.5). 


Remark I (Slater's condition) Note that Slater assumption is standard in convex 
optimization and not overly restrictive. For special problem classes moreover, 
the assumptions can be relaxed. If h is affine for example, the strict feasibility 
condition can be relaxed to feasibility [Ber99, Prop 5.2.1]. Thus for feasible and 
convex quadratic programs for example, Slater's condition is always satisfied. 


2.1.2 Sequential quadratic programming 


There are four main branches for solving non-linear and possibly non-convex 
optimization problems in form of (2.1): 


* gradient-based methods; 

* sequential quadratic programming (SQP); 
* interior point (IP) methods; 

* augmented Lagrangian methods; 


and combinations thereof. Gradient-based methods are typically used for 
large-scale problems, where second-order information is too expensive to eval- 
uate. SQP and IP methods are very successful and exhibit fast convergence 
to local minimizers in many cases. However, each iteration is more costly 
compared with gradient-based methods. Augmented Lagrangian methods are 
often used for large-scale problems and as a building block for other algorithms 
(also within SQP). Here, we review the fundamental basics of SQP and aug- 
mented Lagrangian methods, because both are essential for understanding of 
the Augmented Lagrangian Alternating Direction Inexact Newton (ALADIN) 
algorithm. Gradient methods and interior point methods are not considered 
here—instead we refer to the textbooks [NW06; Ber99; Wri97; Nes13] for 
details. 


2.1 Basics of nonlinear programming 


SQP algorithms can be derived in two different ways. One is based on the 
idea of solving (2.1) by sequentially approximating the problem by a quadratic 
program (QP) where its name comes from. We recall the second way here, 
viewing SQP as solving the optimality conditions (2.2) by a Newton-type 
method. In absence of inequality constraints, the KKT conditions (2.2) for 
problem (2.1) read 


(2.6) 


an= pus : pue Es 


g(x) 


To solve this problem (and thus finding KKT points), let us apply a Newton-type 
method defined via the iteration 


gt! = qk - (M*)'F(q), (2.7) 


where q* := (x*, y^) and M* is a non-singular approximation of VF(g*). The 
Newton-type iteration (2.7) applied to (2.6) yields 


Bk Ve(x*)" xktl_ yk u V f (x^) + y Vg(x*) Q 8) 
Vg(x*) 0 yy g(x*) Í l 


where B* is an approximation of V? £(x*, y*).4 In the next paragraph we will 
recall that if an SQP algorithm is initialized close enough to a local minimizer 
(x*, y*), the pair (x, y) will converge to (x*, y^) at locally quadratic rate 
provided that V2 | £(x, y) is positive definite and LICQ holds at (x*, y*). SQP 
can be extended to problems with inequalities—we refer to [NW06, Chap. 18] 
for details. 


Local convergence of Newton-type methods 


Next, we give a proof of local convergence and convergence rates for the above 
Newton iteration (2.8). We consider approximations of the KKT matrix M* 
instead of exact ones. These approximations are important for example if one 


^ Positive definiteness of B^ is essential for local convergence of Newton-type methods. These 
methods differ in the way they compute B, cf. [NW06, Ch 3.4]. 
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would like to use Hessian approximations B^ x V2, £(x*, y^) instead of exact 
Hessians V2, £(x*, y*) when using Newton-type methods in context of SQP. 
When doing so, the local convergence rate depends also on the "quality" of 
these approximations. Local convergence of the iteration (2.8) is guaranteed 
by the following theorem. 


Theorem 4 (Local convergence of Newton-type methods [Diel6, Thm 8.1]) 
Suppose that F(q*) = 0 holds. Moreover, suppose that g* is initialized close- 
enough to q*, ie. with ||g* — q*|| < um where y and w are defined 
below. Then, the iterates generated by (2.8) converge to q* and satisfy the 
convergence-rate estimate 


QE, ; 
la! = e*ll < (x+ Z1la* — ali) a“ - 4*1 (2.9) 
if there exists an w < œ and a x < 1 such that 
(M ^! (m* - VF(q*))|| < Kë < x, (2.102) 


loro (veo - vr )| s Zh -al. 210% 


Here the former inequality (2.102) is called compatibility condition and the lat- 
ter inequality (2.10b) is a Lipschitz condition. The proof of Theorem 4 is given 
in Appendix A.4 for the sake of completeness and because of its importance 
for the convergence analysis of ALADIN. Next, we provide examples for local 
convergence rates to provide some intuition on the effect of the two conditions 
(2.102) and (2.10b) in Theorem 4. 


Example 1 (Convergence rates of Newton-type methods) Assume that VF is 
Lipscitz continuous. Then, if we choose a non-singular M*, the Lipschitz 
condition (2.10b) is always satisfied. 


For checking the compatibility condition (2.102), consider three variants of 
Jacobian approximations M*: exact Jacobians with M* = VF(g*); damped 
Newton with M* = VF(q*) + ôI for some 6 > 0; and damped Newton with 
vanishing damping M* = VF(q*) + 107**7 for some c > 0. For the first case, 
it immediately follows by (2.102) that x = 0. Thus, we have local quadratic 
convergence by (2.9) (cf. Section A.1 for a brief overview on convergence rate 
estimates). 
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10° 10° 
100 10? 
0 5 10 15 20 0 5 10 15 20 
k k 
(a) Typical convergence rates. (b) Linear conv. faster than superlinear. 


Figure 2.1: Pure Newton (blue), damped-Newton (red) and damped-Newton with vanishing damp- 
ing (yellow) for solving sin(x) = 0. 


For damped Newton, condition (2.102) reads ||(VF(g*) + 61) (VF(q^) + 
ôI- VF(q*))|| = (VF (q*) + ó1)-!| ô = K, where one can make x arbitrarily 
small by choosing a small 6 if VF(q*) is non-singular. This yields linear 
convergence with arbitrary fast modulus «x. 


For damped Newton with vanishing damping term we have «^ := ||(VF(q*) + 
107** r)-! ||107** converging to zero for k > œ. Thus, if VF(q^) is non- 
singular, one obtains superlinear convergence. 


Example 2 (Linear convergence is not necessarily slow) Let us consider 
the problem of solving F(x) = sin(x) = 0. Figure 2.la shows the con- 
vergence of pure Newton's method (blue), damped Newton's method (red) 
with ó € (0.8, 2) and damped-Newton with vanishing damping (yellow) and 
c € (0.3, 0.1). Onecan clearly see that by choosing appropriate ó and c, one 
can make convergence of the damped Newton variants almost as fast as the 
pure Newton's method. Thus, linear convergence does not necessarily mean 
that convergence is slow—it largely depends on the convergence modulus x. 
This is especially relevant since damped-Newton methods for example typi- 
cally converge at a quite fast linear rate if the damping parameter is sufficiently 
small. First-order methods such as ADMM, which we will introduce later, 
can be quite slow but both algorithms are guaranteed to converge linearly for 
certain problem classes. 


Remark 2 (Singular V F(q*)) Note that if V F(q*) is singular, the convergence 
rate is in general linear only—also with exact Jacobians M* = V F(q^). In that 


case, the Lipschitz condition (2. 10b) is violated, since ||( V F(q^)) ^! || — co for 
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q* — q*. Thus, the second last step of the proof of Theorem 4 (Appendix A.4) 
can not be performed. In this case we only know that by continuity of VF, 
ie | mM)! (VF (q* + t(q* — q*)) - VF(q*))|| dr is bounded yielding linear 
convergence. As an example consider the problem F(x) = x? = 0 with x* = 0. 
Then the Newton iteration reads x**! = x* — (x*)?/(2x*) = x*/2 and thus by 
taking norms ||x**! —0|| < 1/2 ||x* —0||. Notice that in this case, regularization 
procedures have to be considered as VF becomes increasingly ill conditioned 
close to q* and thus factorization of M* becomes problematic. 


Remark 3 (Local vs. global convergence) Note that one distinguishes between 
local and global convergence in optimization algorithms. Local convergence 
characterizes the convergence behavior of algorithms when initialized close 
to a local minimizer. In order to reach this region of local convergence, typi- 
cally globalization routines ar required. These routines take aim at driving the 
iterates into the region of local convergence and should then typically “deac- 
tivate”. However, note that although one might guess that if an algorithm is 
globally convergent, it converges to a local minimizer from any initial point. 
This is unfortunately not the case. Typically it is only guaranteed to converge 
to a stationary point of a so-called merit function, which can for example be 
the augmented Lagrangian. Unfortunately, not all stationary points of the aug- 
mented Lagrangian are local minimizers [Ber99, Chap 4.3]. Especially in case 
of non-convex constraints, the situation of linearly dependent constraint lin- 
earizations can occur contradicting the assumption of linear independence of 
the constraints sometimes made for the global convergence analysis of nonlin- 
ear programming algorithms [BT95, Sec 4]. We give a concrete example what 
kind of difficulties can occur especially in context of non-convex constraints 
in Section 2.3.3. 


2.1.3 Multiplier methods 


Instead of working on the optimality conditions (2.2) directly and updating 
primal and dual variables in a simultaneous step, there is a different class 
of algorithms called multiplier methods [Ber82; Ber99]. There, the idea 
is to update primal and dual variables separately. This has advantages in 
several contexts—especially for developing distributed algorithms as in some 
cases the primal updates can be done in a distributed fashion under certain 
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conditions. There are two basic approaches in this class of algorithms: one 
working on the Lagrangian function and a second one working on the so-called 
augmented Lagrangian function which adds some kind of regularization term 
to the objective function without changing the minimizer. The augmented 
Lagrangian for (2.1) (in case of equality constraints only) is defined as 


LP (x,y) = f(x) +" g(x) + AROR 


Note that for p = 0, one recovers the standard Lagrangian. The method of mul- 
tipliers now proceeds in two simple steps: First, we estimate a certain Lagrange 
multiplier A*. Then we minimize the Lagrangian/augmented Lagrangian with 
respect to x in a separate step. The resulting algorithm, summarized in AI- 
gorithm 1, is guaranteed to converge to a minimizer of (2.1) if the multiplier 
estimates are close enough to A*. If we would estimate the multipliers cor- 
rectly, multiplier methods converge in one step if p is large enough, cf. [Ber99, 
Sec 3.2.1]. Moreover, multiplier methods converge to a global minimizer of 
(2.1) for p — œ [Ber99, Prop 4.2.1]. However, note that although this result 
is very appealing from an theoretical point of view, in practice driving p to 
co is problematic due to the corresponding ill-conditioning of V? £? (x, y) and 
the necessity to minimize £P globally which is hard to achieve by practical 
solvers. If we choose the multiplier update to 


Algorithm 1. General multiplier method 


Initialization: y, p 
Repeat: 
1) x**! = argmin LP (x, y^) 
x 


2) y" c yk 


y^ - yk + ck g(x*), (2.11) 


the method is called method of multipliers. 
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2.2 Distributed optimization algorithms 


Now let us consider optimization problems with special structure leading to 
distributed optimization algorithms. Note that in the literature various forms 
of such structures are employed [ND11]. In many cases, they can be converted 
into each other. We give a brief overview on these structures in Appendix A.2. 
Here, we consider problems in so-called affinely-coupled separable form 


min 2. ka) (2.12a) 
MER m 
subject to _g;(x;) = 0, VER, (2.12b) 
h;(x;) < 0, Vi € f, (2.12c) 
> Aixi = b, (2.12d) 
ie 
where R = (1,..., N} is the set of agents or subsystems, to each of which 


we assign an objective function f; : R=} — R, equality constraints g; : 
R'»* — R”si and inequality constraints h; : R! — R"^i, We assume that all 
fi, gi and h; are sufficiently often continuously differentiable. The agents are 
coupled trough (2.12d) which is called consensus or coupling constraint. Note 
that many practical problems in form of (2.1) can be reformulated in form of 
(2.12) by introducing auxilliary variables, cf. Appendix A.2. Next, we recall 
one of the earliest distributed optimization methods called dual decomposition 
[Eve63]. 


2.2.1 Dual decomposition 


To simplify notation, we use the indicator function ix : X > RU {oo}, 


. if xeX 
X Hr 


co else 
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in the subsequent analysis. With the indicator function we can reformulate 
(2.12) as 


min NICO (2.132) 
MD TT. 
subject to b3 Aix; = b, (2.13b) 
ie 


with f; := f; Lx, and X; := (x; € R"* | g;(x;) 2 0, h;(x;) € 0). As the name 
indicates, dual decomposition is closely related to solving the dual problem 
to (2.13). Let us assume that the assumptions made in Theorem 3 (convexity, 
Slater's condition) hold true for problem (2.13). Then we can replace (2.13) 
by its dual problem 


sup q(A) = sup min É (xi) +A" Aix; — b|. (2.14) 
uP q nn E > 


ER 


One very simple way to solve (2.14) is to apply a gradient method. If f is strictly 
convex and X is compact, one can show that q is continuously differentiable 
and its gradient is given by 


Vq(4) = > Aixi(a) - b, 


ief 


where x; (4) is computed by evaluating the dual function g [Ber99, Prop 6.1.1]. 
The gradient iteration for maximization of (2.14) then is 


AFP = aug Vati, (2.15) 


where a € (0 1) is astepsize parameter. Observe that q is separable, i.e we can 
write q as q(A) = Vier gi(A) 2 A" b, where q;(A) = miny,ex, (ax) +A Aixi, 
so for fixed A, q can be evaluated in parallel. The resulting gradient method is 
called dual decomposition and summarized in Algorithm 2, where L; (x;, A) := 
fi(xi) + ATA;x;. Note that dual decomposition is very effective when the 
minimization in the evaluation of the dual can be replaced by an explicit 
expression (e.g. when solving strictly convex QPs). 


An advantage of dual decomposition is that it is very simple to implement. 
Moreover, if the A;s are sparse (i.e. there are many A;s with zero rows), it 
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Algorithm 2. Dual decomposition 


Initialization: A9, {ax} 


Repeat: 
1) xF*1 = argmin £;(xi, A*) Vi € R (parallel) 
Xx; €i 
2) AF = Ak + a* (Die Aixk*! — b) (centralized) 


is possible to evaluate step 2) of Algorithm 2 in a decentralized fashion as 
computing the components of A* involves summands of the subsystems with 
nonzero rows in A; only. On the other hand, dual decomposition requires strict 
convexity of f and compactness of X, or strong convexity of f as otherwise the 
minimizer in step 1) might not exist.? In turn this means that for general convex 
problems, dual decomposition might fail. Moreover, as dual decomposition is 
a gradient method with a fixed stepsize, convergence is in general sublinear 
only [Nes18, Thm 2.1.14]. 


2.2.2 Alternating Direction Method of Multipliers 


To overcome its weaknesses (in particular the restriction to strictly/strongly 
convex problems), dual decomposition can be combined with the method of 
multipliers from Section 2.1.3 yielding the Alternating Directions Method of 
Multipliers (ADMM). Herein, the idea is to use the augmented Lagrangian (in- 
stead of the Lagrangian) adding an additional convexification term rendering 
this algorithm applicable to convex but not necessarily strictly convex prob- 
lems. Unfortunately the augmentation term in the first step of the method of 
multipliers jeopardizes the separability of £ hindering evaluation of the dual 
function in parallel. To overcome this issue, one usually constructs an aug- 
mented problem to (2. 13a) introducing variable copies z of x and then executes 


5 This can be the case even for very simple problems. For example, consider a problem with affine 
cost. Then the problem in step 1) might be unbounded from below and dual decomposition 
would fail. 

6 Asan alternative, one can also see ADMM as a dual ascent method, where one solves a modified 
but equivalent version of (2.13a) with a penalization term p/2|| X jeg Ai xi — b|| in the objective 
[Boy+11]. 


22 


2.2 Distributed optimization algorithms 


the first step of the method of multipliers (Section 2.1.3) in an "alternating" or 
coordinate descent fashion recovering partial separability. We will now briefly 
recall ADMM for the augmented problem and then show how our problem 
(2.132) fits into this setting. 


The augmented problem in so called two-block form reads 


min f(x) + g(z) (2.16) 
subject to Cx + Dz = c. (2.17) 


For applying the method of multipliers, we consider the augmented Lagrangian 
£?(x,2,4) = f(x) + g( ) + AT (Cx + Dz- c) + 5 ICx+Dz-cli3. (2.18) 


We apply the method of multipliers (Algorithm 1) to problem (2.16) with 
one key difference: instead of performing a joint minimization with respect 
to x and z in step 1), we perform the minimization with respect to x and with 
respect to z in two separate steps. This yields ADMM shown in Algorithm 3. 


Algorithm 3. Alternating Direction of Multipliers Method (ADMM) — 
Initialization: z2, A9, p 
Repeat: 

1) x**! = argmin, £P (x ,z*, a*) 

2) zktl = argmin, £P (x+, z ,a*) 


3) k+l = ak + o(Cxk*! + Dzkt! - c) 


Now consider problem (2.13a). Let us write (2.13a) in two-block form by 
introducing variable copies z; = x; for all i € R yielding 


min = FE) + e(z) (2.19a) 
Xp XRSELbe ZR IER 
subjectto A;(x; — zi) 20 VER, (2.19b) 
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where C := {z € R'* | Xieg Aiz; = b). Then, the augmented La- 
grangian (2.18) reads 


L£P(x, 2,4) = X fil) tie) + AL Ai - 2) + MA Qi - zB, 2.20) 
ies 

where A" = (Aj,..., Ay). Now it becomes clear why introducing additional 

auxiliary variables and alternating minimization helps: if one fixes z in (2.20) 

and minimize with respect to x, the minimizations with respect to x; are 


independent of each other and can be executed in parallel. This yields a parallel 
version of ADMM for problem (2.13a) shown in Algorithm 4. Note that the 


Algorithm 4. Parallel ADMM for problem (2.13) 
Initialization: g. ao for alli € R, p 


Repeat: 
1) ge =argmin fj(x;) + AF! Aix; + BIA (xi — £P). IER (parallel) 
xiEXi = 
2) z**! = argmin Der -AF! Aizi + EJA ak! - z)l (centralized) 
zec 
3) ay m ak + pA; (xh! x a), i ER (parallel) 


“coordination step" minimizing £? with respect to z is a convex QP which 
requires central computation. However, this step can be simplified to a simple 
averaging step between subsystems under certain conditions [Boy+11]. This 
renders ADMM a decentralized algorithm which can be executed purely based 
on neighbor-to-neighbor communication. On the other hand, the multiplier 
update step is also executable in parallel. Note that there are further specialized 
variants of ADMM mostly differing in the underlying problem structures. 
Many of them can be reformulated in two-block form, we provide several 
examples in Appendix A.2. 


Due to the augmentation term, if either X is bounded or A has full row 
rank, ADMM is guaranteed to converge also for non-strictly convex problems 
[BT89, Prop 4.2] [Boy+11, App A]. The convergence rate, however, depends 
on strict convexity of f and does not directly carry over from the method of 
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multipliers.’ If f is convex, convergence of ADMM is generally sublinear 


[HY15]. Under certain conditions such as strong convexity [DY16; GB16; 
Shi+14] or acombination of other conditions including strict convexity [HL17], 
linear convergence can be achieved.? 


Remark 4 (ADMM variants) There are various variants and derivations of 
ADMM differing in the order of the updates as well as in the problem setup. 
Moreover, multi-block variants of ADMM exist, where more than two blocks 
of variables are considered [HL17; LMZ15]. The overview paper [Boy+11] 
provides an excellent overview on the most common form of ADMM. 


To illustrate that distributed optimization methods can be very useful in com- 
puter science and especially in machine learning, we present an application 
example of ADMM applied to support-vector machines used for classification 
problems. 


Example 3 (ADMM for machine learning: support-vector machines) A stan- 
dard method for classification are Support-Vector Machines (SVMs) aiming at 
separating sets of points by a hyperplane. For SVMs, distributed optimization 
might be interesting primarily in case of very large data sets. One reason is 
computational speedup, but also data privacy is a reasons why one would like 
to compute on multiple machines in case of sensitive data [FCG10]. An linear 
SVM (for two classes) can be cast as the optimization problem 


: ô 
min 5, (1—cjQwy; - b)), + zlhwlls. (2.21) 
w,b toy 9 


where (x), := max(0,x), y; € R™ are given data points collected in the set 
Y = (L..., Ny} [CL11]. These points are classified into two groups labeled 
with o; € (-1, 1). The optimization variables w € R° and b € R define the 
affine subspace separating the two groups of points. 


7 One explanation for this is the fact that the convergence rate estimates of the method of multipliers 
depends on the exact minimization of £L? jointly with respect to x and z. In ADMM, however, 
one executes a form of block-coordinate descent yielding ane inexact minimization. 

8 Moreover, ADMM is convergent for special classes of non-convex problems, cf. [WYZ19; 
HLRI16]. However, for general non-convex problems, there is—to the best of the author's 
knowledge— no convergence guarantee for convergence. 
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We would like to distribute computation among R = {1,...,R} proces- 
sors. For doing so, we partition the data points y; into R groups Y; with 
Uj=t,....v Yi = Y. Hence, problem (2.21) can be written as 


ô 
min g 1 -e;(wIy; - bi)), + =llwill?, 
{wi }ier, {bi}ier,a,b 2,2, zer ' ^ x 


subject to Wi =W, b; = b, pEAE.AR 


which is in form of (2.12)? The augmented Lagrangian then reads 


m ð 2 
LP {wihierstbihien wb) =D) 2, (=o Wy = bd), + 5lbeili 
ief je Mi 
T - z\,P -2 P L2 
* A; wi — W) + Api(b; — b) + zwi - wl|[5 + 5 ^i - bl. 


ADMM leads to the iterates 
. ô 
(WEHT, BI) = arg min 3^ (1- oW] yy- bi), + 5 llwall3 + Ags Qvi = 9) 


wi,Di 


je Xi 
+ Api (bi — b) + È (llw: - wl + [Ibi - BIB). TER. 


(pu. pr) = Ri Larn, prt. 
ie 


(AKHIT, gitly = (aT. ak) p (CP pF?) — (wT, pp) . o def. 
Note that the first (and by far most expensive) step of the above iterates can 
be implemented in parallel. Note that this first step essentially solves an 
independent SVM problem only considering the data only from Y; with two 
additional summands in the objective function repsresenting information form 
the other data sets by the global averages w and b. 


? Except for f; being non-smooth. However, ADMM can in general also be applied to non-smooth 
problems as the one considered here [Boy+1 1]. 

10 Note that the summands involving the multipliers Aw; and Ap; in the second step cancel out as 
it can be shown they are zero after the first iteration [Boy+11, Chap 7]. Moreover, observe that 
the parallel step can be implemented by a QP solver. 
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Figure 2.2: Convergence of the ADMM-based SVM. 
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Figure 2.3: Centralized and distributed SVM result. 


This implementation is especially useful for large datasets: in this case, the data 
can be distributed equally among all processors by choosing the same carnality 
for each Y;. This way, each subsystem needs approximately the same time to 
solve its local SVM. By doing so, one can (theoretically) keep the execution 
time constant when increasing the number of processors simultaneously with 
the increase in the number of data points. However, keep in mind that such 
decomposition makes sense only for large problems and when using multiple 
processors, or in case privacy is of concern. For small datasets standard 
centralized convex QP algorithms (e.g. interior point methods) will usually 
outperform ADMM. 
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To test the ADMM-based SVM, we consider a set of 2,000 data points dis- 
tributed equally among R = 10 processors, where r% = = plo, pet!) - 
(FT, bE)) is the dual residual, i.e. the degree of stationarity in the KKT 
conditions (2.2a). We use p = 6 = 1 as ADMM and SVM parameters, cf. 
[Boy+11, Chap 3.3]. The datasets are partitioned in the worst-possible way to 
exclude good solutions by chance: we choose 5 subsystems with all positive 
datapoints and 5 subsystems with all negative datapoints. Figure 2.2 shows the 
convergence of the ADMM-based SVM. One can see that ADMM converges 
quite slowly and only to a limited accuracy. However, Figure 2.2 depicts the 
resulting hyperplane computed by the centralized and the distributed SVM. 
The two hyperplanes are almost indistinguishable. This is typical for ADMM: 
it converges to medium accuracy in a hand-full iterations and then slows 
down. However, as for many applications this level of accuracy is sufficient— 
especially in context of machine learning—ADMM seems to be an excellent 
choice for these applications. 


2.2.3 Basic ALADIN 


A fundamental limitation of dual decomposition and ADMM are their con- 
vergence guarantees only for convex or strictly convex problems. One of the 
very few algorithms for non-convex problems is the Augemented Lagrangian 
Alternating Direction Inexact Newton (ALADIN) algorithm [HFD16]. A main 
advantage of ALADIN is its local quadratic convergence rate and local con- 
vergence guarantees also for non-convex problems. 


In contrast to ADMM, ALADIN does not introduce auxiliary variables z 
into the problem formulation. It directly deals with the partial augmented 
Lagrangian of (2.12) with respect to (2.12d) 


LP (x, A) = fa) +x, uu [Dyn - jj = 


iER ie 


2 
YA x;-b|. (2.22) 


ie 
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Consider the method of multipliers, but instead of applying a full minimization 
of LP with respect to x we apply only one equality-constrained SQP step 
yielding 


. 1 Tok T kT 
min 5 54%; Bj Ax; + Vf, (Gxi)Axi + À X Ari + Ax) - b 


ie ie 
2 
" S Aiit Ar) -b 2.23) 
ie 2 


subject to g;(x*) + Vg;(xF) Ax; = 0, VER, 


where we combine equality constraints and active inequality constraints in 
£i)" = (gi(xi)", (tii) ) a, ). Here, the matrix B* isa positive definite 
approximation of the Hessian of the full Lagrangian, i.e. B* R Vis (Cfi (xk) + 
ys gi(x*) + u7 h;(x*)). For the multiplier update we apply the standard dual 
ascent step from the method of multipliers (2.11) 


AS =f au" » Aut! — | . 


ie 


In principle, one could apply this algorithm now to equality-constrained prob- 
lems. This would yield a very effective algorithm since if p is large enough, 
the QP (2.23) becomes strongly convex and thus it can be replaced by solving 
the KKT conditions (2.2) which is a linear system of equations. However, if 
inequality constraints are present, the question arises how to obtain the active 
set A(x*). An alternative is to consider inequality constraints in (2.23), but 
this would make (2.23) substantially more difficult to solve, since the KKT 
conditions (2.2) also entail inequality constraints in this case. 


ALADIN uses a different approach: it introduces a local NLP step very similar 
to step 1) of ADMM (Algorithm 4). This step reads 


min $^ fix) + A Aix + z lxi — zt. (2.24) 
"TER 


where 2; is a (usually diagonal) positive definite scaling matrix and where 
we introduce auxiliary variables z^ serving as a second iterate in ALADIN. 
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As an active set for (2.23), one can use the active set from the minimization 
of problem (2.24). This has additionally the charm that, if derivative-based 
numerical solvers are used to solve (2.24), one can reuse these derivatives for 
setting up (2.23). Note that (2.24) is separable, i.e. the minimization can be 
parallelized. Combining the above yields the ALADIN algorithm summarized 
in Algorithm 5. Note that 2; (xk) = 0 in Algorithm 5 as feasibility is ensured 
in step 1) and that Vg is the block-diagonal concatenation of all Vg;. 


Algorithm 5. Basic ALADIN 


Initialization: Zz, 29, Xj > 0 for alli € R, v, p 
Repeat: 


1) Solve for allie R 


: . v . 
xk = argmin fi(xj) + AF Aixi + 3 lx; — <i Ils. (parallel) 
xi€Xi 


2) Compute V f, (x4), BF = Vix, (f Gb +77 s GP e ud he), vi. 
3) Solve the coordination QP 
Ax* = argmin Die 1Ax7 BE Ax; + V f? (xk) Ax; (centralized) 
Ax 
: p 
+ AT (Zier Ailxk Ax) - b) + I Zier (GE + Axi) - bil? 
subject to Vg(x )Ax =0. 


4) Set zit! = xk + Ark and at! = ak + p (Zier Aixi - b). (parallel) 


Remark 5 (Slack variables for numerical stability) In [HFD16], the QP step 
(2.23) is replaced by the equivalent formulation 


1 
min ) zAx] B An + Vf Gf) Ax eA d + ALIE 
ief 


Ax,d 

subjectto g;(xf)- Vgj(xf)Ax; 20 VIER, (2.25) 
SAGE + Ax) bad AP, 
ie 
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where d € R"« are slack variables. These slack variables are important for 
numerical stability of ALADIN in implementations. For the understanding 
of ALADIN, this formulation does not add much, but due to their practical 
importance we will consider this formulation in the following. 


Remark 6 (Computation of A**!) Note that A**! coincides with AQ? from 
(2.25). This can be seen from the optimality conditions to (2.25), AP = 
AK + pd =A* + p (Die Aix; — D). 


Remark 7 (Globalization of ALADIN) Note that for global convergence, i.e. 
convergence from remote starting points, a globalization rountine is necessary. 
This means that step 3) of ALADIN is replaced by 
ze = zk + ay (x* — zb + a2 Ax*, AFP e dta aV e ay 

where the step-sizes @1,@2 and o3 are determined by these routines. The 
setting v, = a2 = a3 = | recovers step 3) of Algorithm 5, which is called the 
full step variant of ALADIN. One globalization routine is given in [HFD16], 
however it is highly centralized and computationally intense. In the present 
work we focus on local convergence, i.e. on convergence for initial guesses 
close enough to a local minimizer where a globalization routine can be omitted. 


2.3 What's wrong with constraints? 


In this subsection we will briefly comment on difficulties, which can arise in 
the application of distributed optimization algorithms to constrained problems. 
Constraints are important in many cases—particularly for applications in power 
systems and control which we have in mind. 


2.3.1 Alternating projections in ADMM and ALADIN 


Constrained problems might lead to difficulties in alternating direction methods 
such as ADMM. As outlined before, the core idea of ADMM is to minimize the 
augmented Lagrangian £L? with respect to two distinct variable blocks x and z in 
an alternating fashion to obtain separability. The name "alternating direction" 
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already suggests, ADMM is a kind of alternating projection method between 
the two types of constraints—local constraints and consensus constraints. This 
can be seen by the two different constraint sets X; and C in the minimization 
steps of Algorithm 4. These alternating projections might lead to very slow 
convergence of ADMM as we will see next. 


An example 


Let us consider a convex feasibility problem (i.e. f = 0) with a local constraint 
set X 


minix(x)  subjectto | Ax — b. (2.26) 


Then ADMM (Algorithm 3 simplifies to the following alternating projection 
procedure 


„er = Tx (z* E u*) 


SU = Het a uk) 

uk = uk 4 ht! ES QU 
where IIc : R"* — C denotes the orthogonal projection on the “consensus 
constraint set C = (x € R™ | Ax = b}, see [Boy+11, Sec 3.1.1]. Let us 
consider the 2-dimensional example 


C-(xeR^|x;20) and X-(xeR? Ix -žl < 77}, 


with x = (10 50.99)" and r = 51. The set of optimal solutions is given by 
X* -XnC - (x e R2 | (c 0)7,c e [899 11.01]}. 


Figure 2.4 depicts the two feasible sets X and C considered in step 1) and 
step 2) of ADMM with the corresponding iterates starting from the initial 
point z? = (0, 0.5)" and u? = (0, 0)7. One can clearly see the alternating 
projections in both algorithms: the iterates {x*} stay feasible in X and the 
iterates (z^) stay feasible in C—thus “alternating” between these two sets.!! 


!! For ALADIN, the iterates (z^) slightly deviate from C as the consensus constraints (2.12d) 
are considered in the penalization term of LP. 
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Figure 2.4: ALADIN and ADMM iterates for problem (2.26). 


For ADMM convergence is slow since pure projection between both sets makes 
almost no progress toward their intersection. For ALADIN, the situation is 
different since here also curvature information of the constraint set X (i.e. 
the constraint linearizations Vg(x^) in step 2) of Algorithm 5) is considered 
driving the iterates much faster to X N C. This can be seen from the step 
directions (x* — z^) being almost tangential to the boundary of X. 


Remark 8 (Description of feasible sets) Note that considering constraint lin- 
earizations in ALADIN comes at the cost of requiring the feasible set X to 
be described by differentiable constraint functions g and h, ie. X = (x € 
R” | g(x) 2 0, h(x) < 0} as in (2.1b). This can be seen as a restriction to the 
class of distributed problems, to which ALADIN is applicable. ADMM on the 
other hand does not have such a restriction and is thus able to handle a general 
convex constraint set. In view of the fact that the majority of practical applica- 
tions has constraint sets described by differentiable functions, not considering 
this information means to accept a unnecessary poor performance. 
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Figure 2.5: Infeasibility in C over the iteration index. 


2.3.2 Integrator wind-up in ADMM 


Apart from slow convergence from alternating projections, there seems to be a 
second effect in ADMM slowing down its performance. This effect is related to 
the multiplier update in step 3) of ADMM (Algorithm 4) acting as a integrator. 


Figure 2.5 shows the infeasibility of the iterates (x*) in terms of the constraint 
set C, i.e. [xk | over the iteration index k. Here one can see that not only 
ALADIN converges much faster than ADMM in terms of total iterations, but 
also that the accuracy in the constraint violation one can reach with ADMM 
seems to be limited to a level of around 10-7. This is surprising, as ADMM is 
shown to converge monotonically [HY 15]. 


Figure 2.6 also shows the the infeasibility of the iterates (x*) in terms of the 
constraint set C but for a much larger number of iterations and additionally 
also the scaled dual variables u*. Here, one can observe a very surprising 
effect: whereas in the primal space, there is no visible progress for more than 
600 iterations, the dual variables change slightly. After about 700 iterations, 
the primal variables converge suddenly and to a very high accuracy. 


By having a look at the dual variables, one can see why: they integrate to a 
value of about 6 in the first few iterations and then converge only very slowly 
to their optimal value of zero, which seems to hinder the primal values from 
convergence. 


One might ask why this is the case. From Figure 2.4 one can see that ADMM 
performs quite large alternating projection steps, which drive the dual variable 
to ahigh value in the first iterations, since the multiplier update rule “integrates” 
the distance (x**! — z**!). Once the z* iterates reached the feasible set, the 
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100 200 300 400 500 600 700 
Figure 2.6: Infeasibility in C and scaled dual variables. 


alternating behavior continues although the iterates z* are already feasible. As 
in this new area of oscillation (Figure 2.4, second plot), the distance (xttl -z4+1) 
is quite small, it takes several hundreds of iterations to dismount the before 
integrated value of the multipliers. Once they reached their optimal value, 
ADMM suddenly converges. Observe that for this example, the behavior is 
independent of p, since p does not appear in the ADMM iterations. Thus, this 
effect can not be overcome by tuning. 


To the best of our knowledge, this effect is not investigated in the literature in 
detail so far. Moreover, it seems to be one reason for very slow convergence 
of ADMM for some OPF problems (cf. Section 5.4.1). 


2.3.3 Non-convex constraints 


Many distributed algorithms require the feasible set to be convex. If one 
nonetheless applies these algorithms to problems with non-convex constraints, 
the convergence is not guaranteed. Thus, the question arises, whether there 
is a particular difficulty for distributed algorithms to handle non-convex con- 
straints. Next, we will show that this is not the case. More specifically, we will 
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show an example, where all here considered algorithms diverge—including 
centralized methods such as the method of multipliers and SQP diverge con- 
tradicting the common wisdom that they can achieve convergence to a local 
minimizer from an arbitrary initialization by considering suitable globalization 
routines. 


With this, we would like to emphasized that there is an inherent difficulty asso- 
ciated with certain kinds of non-convex constraints from which all algorithms 
suffer—not only the distributed ones. This should shape our expectation that 
it is hardly possible to design distributed algorithms which are able to handle 
general non-convex constraints. The best we can achieve also here is local 
convergence—i.e. convergence from an initial point close enough to a local 
minimizer. ! 


What's wrong with non-convex constraints? 


Consider the non-convex problem 


min 107? (Qı - 8)? + (x2 - p (2.27a) 
xeR? 
subject to x2 — sin(x;) =0 (2.27b) 
(0.15, -1)x =0. (2.27c) 
This problem can be written as 
min 1072 (e ~8)2 + (x - 1?) + ix, (x) +ix,(x) (2.28) 


with a non-convex constraint set X; = {x € R? | m — sin(x;) = 0} and 
a constraint set Xo = (x € R? | (0.15, -1)x = 0). The contour lines 
of the objective and the feasible are depicted in Figure 2.7. The feasible 
set consists of three points X = ((0,0)7, (2.72, 0.41)"}, one of which 
(x* = (2.72, 0.41) ") is globally optimal. 


12 One expectation are centralized global optimization algorithms such as a-branch-and-bound 
[Ste17]. However, these global optimization techniques are typically very expensive and often 
due to that not applicable to large problems. 
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Figure 2.7: Contour plots and feasible set of problem (2.27). 


Three of the four considered algorithms—ADMM, ALADIN and the method 
of multipliers—are based on minimizing the augmented Lagrangian which for 
A = 0 reads 


LP (x,0) = 107? (es -8)4(5- 1?) + ix, (x) + 5 IK0.15, -1) I2. 


The theory of augmented Lagrangian-based methods predicts that if we iter- 
atively minimize £? for a fixed multiplier A* and we let p — oo, then these 
methods will converge to the globally optimal solution to (2.27) regardless of 
the update rule for A* [Ber99, Prop. 4.2.1]. At first glance, this sounds very 
promising, but let us have a closer look on £? when increasing p. Figure 2.8 
shows the contour lines and the feasible set of the problem 


min LP (x, 0) (2.29) 


for p = 10°. One can clearly see that £? (x, 0) has two local minima in the plot- 
ted range: one at the global optimal solution to problem (2.27), (2.72, 0.41)" 
but also another one at (7.7, 0.98)" which is not even a feasible point. This 
highlights an inherent practical difficulty of augmented Lagrangian methods 
for non-convex constraints: if £? is iteratively minimized with local solvers 
(which is often the case in practice), then the method might converge to a point 
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(a) Top view. (b) Side view. 


Figure 2.8: Feasible set and contours of (2.29) for p = 108. 


which is neither feasible nor optimal.!? The premise of the before mentioned 
theorem is that L? is minimized globally, which can not be guaranteed here 
due to using local solvers. Consequently, this can be observed in ADMM, 
ALADIN and the method of multipliers if their subproblems are solved with 
local solvers. 


Numerical behavior 


Figure 2.9 shows the iterates for ADMM, ALADIN and the method of mul- 
tipliers from an initial point x? = (5, —0.5) and 4? = 0. One can observe 
the convergence to the infeasible point (2.72, 0.41)" for all these methods 
and similar to the previous chapter the "alternating behavior" of ALADIN and 
ADMM being projecting between the two constraint sets defined by g and 
Ax = b. 


Moreover, for SQP, after several iterations close to the infeasible point, SQP 
produces iterates far from any feasible point (indicated by the blue arrow in 
Figure 2.9). The reason for this is that Vg and A become very close to linearly 
dependent at (2.72, 0.41)" leading to intersection points of the spaces spanned 
by Vg and A very far away from the previous iterate. Globalization routines 


15 This effect is even strengthened if the local solvers are initialized at the previous iterate which 
is also often done in practice for computational efficiency reasons. 
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Method of multipliers SQP 


Figure 2.9: Iterates of ADMM, ALADIN, the method of multipliers, and SQP for problem (2.27). 


do also not help here. Most of the global convergence proofs for SQP require 
either an a priori assumption boundedness of the iterates (x^), {A*} [S0109], 
or they require linearly independent constraint linearizations [BT95, Thm 4.1], 
which are both violated in this case. Even if these assumptions hold, the 
convergence of SQP methods is typically guaranteed to a stationary point of 
some merit function such as £? [BT95, Thm 4.2]. However, these stationary 
points do not necessarily correspond to a local minimizer as we observed in 
our example. 


Note that this effect can not occur for convex problems. In this case, all sub- 
problems are also convex and solvers necessarily return global optima. In 
summary, we can conclude the following: Especially non-convex constraints 
are difficult to handle—not only for distributed but also for centralized algo- 
rithms. Thus, designing distributed algorithms with convergence guarantees 
from arbitrary initial points might be out of reach. However, in practical prob- 
lems we often have initial guesses close enough to local minimizers making 
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these algorithms nonetheless very successful in many domains. A similar class 
of counterexamples for interior-point methods can be found in [WBOO]. 
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In this chapter, we provide an overview on the state-of-the art of distributed 
optimization algorithms. Moreover, we investigate what makes these algo- 
rithms hard to apply in many practical applications. The amount of literature 
on distributed optimization grows very fast and this overview is far from being 
complete. However, we tried to identify main lines of research and aimed at 
identifying important works for each of these lines. 


We categorize the research on distributed optimization along three main lines 
of research: 


* primal-dual algorithms, 
* primal algorithms, 


* and internal decomposition algorithms. 


Whereas primal-dual methods are mainly based on Lagrangian relaxation (cf. 
Section 2.2), primal methods work purely in primal space, i.e. they do typi- 
cally not update any form of Lagrange multipliers. The third line of research 
considers internal decomposition methods. These methods decompose oper- 
ations of standard nonlinear programming algorithms such as solving a linear 
system of equations. We begin with primal-dual methods as one of the earliest 
approaches on distributed optimization. 


3.1 Primal-dual algorithms 


Early works on distributed optimization trace back to Everett Dantzig, Wolfe 
and Benders [Eve63; DW60; Ben62] in the 1960s. These first works mainly 
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| dual decomp. | distributed subgradient | EXTRA, NEXT, ... 
| Bender's/Dantzig-Wolfe decomp. | ADMM | distributed interior point 
1960 1980 2000 today 


Figure 3.1: Timeline of distributed optimization algorithms. 


considered Lagrangian relaxation for strictly convex problems and decompo- 
sition methods for linear programs. Later, the Lagrangian relaxation was com- 
bined with augmented Lagrangian techniques developed mainly by Hestenes, 
Powell and Miele [Pow69; Hes69; Mie+71; Mie+72] to improve numerical 
stability and to provide guarantees also for convex but not strictly convex 
problems. This led to first versions of ADMM with improved convergence 
guarantees and improved practical convergence [GM76; GM75; EB92]. Many 
of these duality-based works are summarized excellent textbooks by Bertsekas 
and Tsitsiklis [BT89] and Censior and Zenios [CZ97]. A timeline of distributed 
optimization approaches is shown in Figure 3.1. 


Renewed interest in the late 2000s 


Distributed optimization gained new interest in the late 2000s mainly in the 
field of machine learning and imaging science, where ADMM outperformed 
state-of-the art methods in certain applications [ABF10; SMG10; GM12]. 
Moreover, new applications in signal recovery emerged [CP07; CP11]. The 
main motivation here was computational speedup, i.e. to find methods for 
parallel computing.! Moreover, state-of-the-art methods have been shown to 
be a special case of ADMM [GO09; Gol+14] allowing their treatment in a 
unified framework. 


Duality-based optimization methods were used in communication networks 
[Chi+07] beginning already in the late 1990s [KMT98; LL99]. Similar ap- 
proaches were used in wireless sensor networks [MBG10; SRGO8], in signal 


! A nice example for this need is given in [BMNOI]. There, the authors consider a image- 
reconstruction problem from from positron emission tomography. Even an evaluation of one 
gradient of the objective took 15 to 45 min there, so there is no chance for using Newton-based 
schemes such as IP or SQP methods. 
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processing [ZGC09] and in certain fields of machine learning (support-vector 
machines) [FCG10]. A different version of dual decomposition is established 
in [NS08] called proximal center method. Herein—instead of minimizing the 
augmented Lagrangian in an alternating fashion to achieve separability—the 
author adds two linear proximal terms which are separable and lead to a dif- 
ferentiable dual function. An early dual-decomposition based approach for a 
multi-agent setting was presented in [TTM11]. The highly influential paper 
[Boy+11] showed that many of the above works can be treated in a unified 
framework based on ADMM. The importance of this framework lies in its 
generality—the convergence analysis of [Boy+11] carries over to the above 
applications. An even more general convergence analysis of decomposition 
methods (including ADMM) based on augmented Lagrangians is given in 
[ST14]. A dual method combining with smoothing and an excessive gap 
technique for convex problems is presented in [TDSD13]. The numerical re- 
sults in this work are promising and the method has a main advantage over 
other primal-dual methods: the parameters are updated automatically. Using 
an accelerated scheme in the dual step yields a convergence rate of O(1/k?) 
for general convex problems. A generalization is given in [TND16] includ- 
ing techniques from [Nes05a; Nes05b] allowing for inexact solutions of the 
subproblems and providing automatic parameter updates. 


The works starting with [WO12] consider a network setting, i.e. they introduce 
consensus constraints on characteristic matrices of the underlying communi- 
cation graph (incidence matrix, Laplace matrix, cf. Appendix A.2). [Shi+14] 
considers a problem formulation in form of (A.10). This leads to an edge- 
based formulation (ADMM maintains two Lagrange multipliers per edge). 
[MO17] develops a node-based formulation, where one Lagrange multiplier 
vector is maintained for each node in the network, where the network con- 
straint is encoded based on a Laplacian formulation similar to (A.12). Both 
cases lead to sublinear/linear convergence under convexity/strong convexity as- 
sumptions. The very recent work [Mao+20] presents a decentralized method 
for unconstrained non-convex consensus optimization with guarantees in à 
network setting. 
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Recent works considering non-convex objectives 


Recently, primal-dual algorithms were successfully extended to problems with 
non-convex objective function, but convex constraint set with guarantees. An 
ADMN-like algorithm, where the augmented Lagrangian is minimized in a 
block-coordinate descent scheme, is presented in [HJ14] based on the conver- 
gence analysis of [ABS13]. The analysis is based on new concepts beyond 
convexity such as the Kurdyka-Lojasiewicz (KL) property and Lipschitz/reg- 
ularity assumptions. The KL property is for example fulfilled, if the objec- 
tive and the constraint functions are-semi algebraic. This line of analysis is 
combined in [HJ16] with generalized equations from [ZA10] to a distributed 
control setting. The KL framework is developed further in [BST18; ST19a]. 
Approaches applying ADMM in the field of optimal control can be found in 
[KCD15; RCG17; BG19]. A dual second order method decomposing strictly 
convex QPs from optimal control along the time-axis is presented in [FSD 15]. 


The works [Cha+16b; Cha+16a] show that ADMM with standard consensus 
and strongly convex f;s converges linearly even under asynchronous operation 
with bounded communication delay. [CDZ15] and [CZ17] present an ADMM- 
flavoured method providing convergence guarantees also for non-convex f; 
under strong second-order sufficient conditions but with convex constraint 
sets. Moreover, [HLR16; LP15] analyze the convergence properties of ADMM 
for non-convex problems including certain classes of non-convex constraints. 
An more general convergence analysis is given in [WYZ19] including the 
[HLR16; LP15] as special cases. The line of analysis of [WYZ19] is extended 
in [GGC20] to problems, where the consensus constraint (2.12d) can be a 
multiaffine mapping. 


Remark 9 (Relation to operator splitting methods) There are excellent works 
generalizing primal-dual distributed algorithms in the framework of operator 
splitting and proximal methods [CP11; Sta+16; PB14]. This perspective paves 
the way for unified convergence analysis. Moreover, it leads to suggestions 
for effective preconditioning [GB15; GB17]. However, this point of view is 
beyond the scope of the present work as it requires a large amount additional 
mathematical prerequisites from operator splitting theory. We refer to [BC11] 
and the above works for further details. 


44 


3.1 Primal-dual algorithms 


Table 3.1: Typical properties of primal-dual decentralized algorithms. 


update idea convergence pros/cons 


x! € argmin, LP (x, AF)? maximize dual+ fe S: O(1/k)/ + decentralized 


AM! = ak + p(Ax — b) evaluate primal O(1/k?) + robust 
LP : augmented Lag. distributedly fe Su: O(T¥) + extension to certain 
p : stepsize non-convex settings 

- tuning 


- non-conv constraints? 


Pros and cons of primal-dual methods 


Primal-dual methods are very successfully applied in many contexts— especially 
for convex problems. For certain applications—e.g. in embedded optimal 

control, signal recovery and machine learning—they lead to promising perfor- 

mance [Ste+20; CP07]. However, most of the primal-dual approaches suffer 

from two drawbacks: 


* the constraint-set is often restricted to be convex; 


* and the convergence rate is typically linear/sublinear for strongly con- 
vex/convex problems. 


We refer to [DY16; GB16; PSB14] and [HY12] for a detailed convergence 
analysis. 


Recent overviews on decentralized optimization considering primal and dual 
approaches can be found in [NOR18; NOW18; Boy+11; Yan+19; HM11]. 
Examples on the application of ADMM to machine learning problems can 
be found in [Tay+16; Sul+19]. Recent surveys on optimization methods for 
machine learning including primal and dual methods can be found in [Ber15b; 
BCN18; SNWI2]. Overviews on splitting methods for control with emphasis 
on time-wise decomposition are given in [Sta+16; Fer+17]. 


Typical properties of primal-dual algorithms are summarized in Table 3.1. 


? By setting p = 0 one obtains dual decomposition. 
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3.2 Primal algorithms 


The second main line of reserach are primal first-order methods. These meth- 
ods are usually decentralized variants of the gradient/subgradient or proximal 
gradient method producing iterates purely in the primal space. Hence, they 
do usually not maintain any form of Lagrange multipliers. We give a brief 
overview on latest developments in a network setting. For a recent and compre- 
hensive analysis on their centralized counterparts we refer to [Bec17; Berl5a]. 


Classical approaches 


In contrast to dual methods, primal distributed methods are distributed in 
the design of the optimization algorithms. One of the most general settings 
is given in [TBA86], which is also one of the earliest and most influential 
papers on primal methods in distributed optimization. To illustrate the idea 
of primal methods and to be able to classify relevant literature, we give a 
simplified version of the iteration scheme typically used in primal methods 
(assuming deterministics, synchronization, no communication delays and a 
one-dimensional decision vector). This (simplified) iteration reads 


xttl = werk — Ak sk, (3.1) 


Here, W* is a so-called mixing matrix and s* encodes some form of gradien- 
t/subgradient or Newton step with respect to the individual objective function 
terms f;, and AK is a diagonal matrix providing positive stepsizes for each B 
The matrix W*x* can be seen as a kind of averaging step between neighbors 
driving the individual x; toward a consensus and the vector s^ contains some 
desirable direction of descent in the objective function driving the iteration 
towards optimality. The connectivity of the Network is here encoded in the 
matrix W^ (which might be time-varying). Convergence of these methods 
is guaranteed in a very general setting including asynchronous operation and 
communication delays [Ber83; TBA86]. An example for an early application 
of this framework to distributed reinforcement learning can be found in [Tsi94]. 


3 The matrix W* has to fulfill certain assumptions, e.g. it has to be stochastic or doubly stochastic 
meaning that all entries have to be non-negative and the column sums of W* have to be all one. 
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Consensus and first-generation algorithms 


In the 2000s, so-called consensus algorithms gained significant interest [JLM03; 
Boy+05; OFMO7; OT09; Blo+05].* Consensus algorithms compute (roughly 
speaking) weighted-averages between neighbored agents sequentially until a 
certain level of consensus (i.e. level of “equalness” in the primal variables) 
is reached. Thus, the iteration of consensus algorithms is x**! = Ax*. By 
using tools from classical discrete-time control theory, one can accelerate con- 
vergence of consensus algorithms by choosing A in a certain manner. From 
this perspective it seems natural that consensus algorithms are closely related 
to the before-mentioned primal algorithms. The work [BT07] showed that 
some of them are a special case of the earlier work [TBA86], where roughly 
speaking speaking the vector s^ in the primal optimization framework (3.1) 
is zero. However, these development renewed interest in decentralized primal 
algorithms and led to the development of decentralized subgradient meth- 
ods (DSG) for non-smooth f;s (e.g. for /1-regularized problems) [NO09]. 
Thereby the authors exploit that consensus problem can equivalently be for- 
mulated as a feasibility problem, i.e. one minimizes a zero function encoding 
consensus as a constraint which is mathematically described by a constraint 
set X = (x € R™ | x1 = x? =,...,= xg) [Ned+14]. The decentralized 
subgradient method was extended to a stochastic setting with subgradient- 
projection where all agents have knowledge over a global constraint set and 
thus also decentralized constrained and non-smooth optimization became pos- 
sible [SRNV10; Ned11]. 


In [DAW 12], DSG was extended to a decentralized proximal gradient method 
based on [Nes09; DAW 10] allowing for constraints in a deterministic, syn- 
chronous and undelayed setting. It also analyzes the influence of the network 
structure on the convergence. In the same setting, a decentralized primal-dual 
subgradient method was proposed in [ZM12]. All the above methods require a 
diminishing step site to converge to an optimal solution—one reason for making 


^ A consensus problem is a problem, where multiple entities have to agree on something. Exam- 
ples for this can be opinion dynamics, where after a while the population agrees on a certain 
political decision; sensor networks where multiple measurements have to be aggregated, i.e. 
there has to be a “consent” on the measured value; formation control, where autonomously 
flying objects have to agree on a certain formation, or self-organize biological systems, where 
swarms of animals (e.g. fishes) move in a certain direction. 
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them slow compared to other methods [NOR18]. To accelerate convergence, 
primal methods based on Nesterov's accelerated gradient ([Nes83]) for un- 
constrained problems over networks have been proposed in [CO12; JXM14]. 
They exhibit an improved sublinear convergence rate. The work [YLY16] 
analyzes the convergence of DSG schemes in case of a fixed stepsize. Therein 
authors show that in case of a fixed stepsize, the accuracy of DSG is limited 
by the stepsize, i.e. DSG converges to a more exact solution with a smaller 
stepsize. On the other hand, this slows down convergence and hence there is 
a tradeoff between convergence speed and achievable accuracy which can be 
seen as a drawback of DSG methods. This effect is called the exactness-speed 
dilemma. ^n extension of the analysis of DSG and prox-DSG methods to the 
non-convex regime is given in [ZY 18] showing that most properties from the 
convex setting are preserved. This work considers new analysis tools suitable 
for the non-convex case and shows convergence to stationary points (and to 
limit cycles) under almost the same conditions as in the convex case. An suffi- 
cient conditions for optimal convergence in time-varying networks applicable 
to the algorithms before is given in [Rog+19]. 


Second-generation algorithms 


A major drawback of the algorithms from before is the need for a diminishing 
stepsize to guarantee their converges making them typically slow. A new type 
of decentralized primal method called Exact First Order Algorithm (EXTRA) 
was developed in [Shi+15b] overcoming this difficulty. The main idea there 
is to introduce a correction term in the decentralized gradient schemes from 
before, compensating the non-stationarity of the f;s in the limit. The correction 
term is a sum of previous iterates—hence, one has to maintain one additional 
vector while iterating. The update formulas are 


xl = werk — AVE. (3.2a) 
y"! = we yk + V f(x) — VfR). (3.2b) 
Here, the first term in (3.2b), W*y*, is the newly introduced compensation 
term. By doing so one yields convergence also for a fixed step size overcom- 


ing the exactness-speed dilemma. EXTRA yields sublinear convergence for 
convex objective functions and linear convergence for strictly convex objec- 
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tive functions. An extension to constrained problems based on the proximal 
gradient method is given in [Shi+15a]. Also numerically, EXTRA has shown 
to outperform existing DSG schemes. A similar scheme (Aug-DGM) was 
proposed in [Xu+15] allowing for uncoordinated stepsizes. Algorithms very 
similar to EXTRA and Aug-DGM haven been proposed in [Ned+17; QL18]. 
In [Ned+17], the developed (DIGing) algorithm provides convergence guar- 
antees for strongly convex functions over time-varying graphs using new ar- 
guments from control theory (small-gain theorem). The authors of [QL18] 
prove sublinear convergence for convex functions. An extension of [Ned+17] 
to uncoordinated stepsizes is given in [NOS17]. A Nesterov-accelerated (Acc- 
DNGD) is given in [QL19] with significantly improved sublinear convergence 
rate. Asynchronous operation and time-varying directed graphs with linear 
convergence proof is considered in [Pu+20] with a similar approach given in 
[XK18]. [MJ18] presents and algorithm similar to DIGing with improved prac- 
tical convergence and a substantial reduction in communication. Non-convex 
objectives are considered in [TT17]. 


A different line of research in distributed primal methods was proposed in 
[Scu+14]. Therein, the main idea based on [RHL13] is to use a proximal- 
gradient like scheme, where non-convex parts of the objective are linearized 
and the evaluation of the proximal operator is done in a Jacobi or Gauss-Seidel 
like procedure. This leads to convergence to stationary points also in case of 
non-convex f with a convex constraint set. Note that this method also needs 
a diminishing stepsize. These ideas are combined with a dynamic consensus 
algorithm from [ZM10] leading to a very effective algorithm [DLS16] called 
NEXT with an empirically linear rate of convergence and convergence speed 
similar to ADMM. An extension to arbitrary and time-varying digraphs is 
given in [SSP16] (SONATA) containing NEXT, Aug-DGM and DIGing as 
special cases. 


Decentralized Newton-methods 


Another main line of research is primal decentralized optimization based 
on Newton-Methods. The article [JOZ09] considers a decentralized New- 
ton scheme very similar to the one we will use later for bi-level ALADIN. The 
consensus constraint is defined via a graph incidence matrix, i.e. each node 
maintains only one decision variable. The scheme is unconstrained. The New- 
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ton step is computed via a decentralized consensus scheme. Errors in the step 
computation are included in the convergence analysis. A similar approach is 
proposed in [TBJ19], where in contrast to the method before, the dual Hessian 
is converted into a system of symmetric diagonally dominant (SDD) matrices 
and this new (lifted) system is solved by a decentralized solver specifically de- 
signed for SDD systems. A combination of Newton based ideas with average 
consensus is given in [MLR17; Mok+16] which converge quite slow even for 
strongly convex problems. The approach in [Var+16] is at least able to gain a 
practically linear rate of convergence. 


Pros and cons of primal methods 


A main advantage of primal methods are that thy often come with advanced 
features such as asynchronous operation, considering time-delays or a changing 
network topology. However, their convergence is typically very slow although 
second-generation algorithms made significant progress towards acceleration 
of convergence. Moreover, having optimization problems from power systems 
or for distributed optimal control in mind, the current state-of-the art primal 
methods have the major drawback that they do often not consider constraints. If 
constraints are considered, they are typically assumed to be convex. Moreover, 
knowledge of the global constraint set by all agents is sometimes required. 


We refer to [NOW18; NOW18; Ned14; Yan+19] for excellent and more com- 
prehensive reviews of distributed primal methods. 


As constraints are of major importance in the applications from power systems 
and control we have in mind, we do not consider primal methods further in 
this thesis. 


1? The convergence of first-generation algorithm is sublinear mainly due to the required diminishing 
stepsize. If a constant stepsize is used, however, convergence might be faster but the “exactness” 
of the solution is limited in that case [NOR18; YLY16]. This effect is called the exactness-speed 
dilemma. 
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Table 3.2: Typical properties of first and second generation primal decentralized algorithms. 


update idea convergence pros/cons 
8 akt = wk yk _ AKSK consensus + fe S, Sp: + decentralized 
5 w^: mixing matrix gradient/ O(1/NK) / * very simple iteration 
5, Ak: stepsize Newton step O(1/k)!? + simple Subproblems 
rs E HAM * (changing topology) 
= s~ : step direction + (directed graphs) 


- convex X required 
- exactness-speed 


dilemma!’ 
- very slow 

s xo pk yk _ AK yk consensus + fe S:O(1/k)  - no exactness-speed 
E ed = V f (x!) m Vf(x*) aes fe Su : O(t*) pai 

o 

= k+l _ yk k k+l racking + decentralize 

B k " 2 TE . + simple iteration 
R W` : mixing matrix + simple subproblems 

AK. stepsizes + (changing topology) 


+ (directed graphs) 
- convex X required 
- slow 


3.3 Internal decomposition methods 


The third main line of research are internal decomposition methods. Internal 
decomposition aims at distributing certain steps from centralized nonlinear 
programming methods such as interior point methods or SQP. By doing so, 
the convergence guarantees of the “host algorithm” are inherited. In interior 
point and SQP, the most expensive centralized operation is solving a liner 
system of equations (KKT systems) in each iteration. Whereas for interior 
point methods decomposition of the KKT system is often straight-forward, for 
SQP methods the situation is a bit more difficult. Here, one has to consider an 
extra routine detecting the active set or, alternatively, use a method which can 
handle constraints as an inner algorithm. 


Classically, parallel linear algebra routines such as the Gauss-Seidel, the Jacobi 
iteration or the conjugate gradient method were employed for distributed com- 
putation of the KKT system [BT89, Chap 2], [Saa03, Chap 4]. However, these 
classical methods usually decompose along rows/columns of the KKT matrix 
aiming for load balancing and they are thus typically not directly applicable to 
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a multi-agent setting. Domain decomposition methods (or Schur-complement 
methods) [Saa03, Chap 14] are more suited for the multi-agent setting and we 
will use similar techniques in Chapter 4. 


Distributed interior-point and SQP methods 


Early works on distributed interior point methods for convex problems in form 
of (2.12) start with [HOSO1], where the KKT system is solved by specialized 
solvers for block-tridiagonal systems. A Schur-complement-based approach 
is presented in [GG07] and [GG09], where [GG09] efficiently exploits nested 
tree structures. The work [NS09] combines dual decomposition with ideas 
from interior point methods, where the KKT system is condensed based on a 
Schur-complement technique and subsequently solved by a centralized linear 
solver. A generalization thereof is presented in [Din+13] coming with a novel 
convergence analysis allowing for inexact solutions of the subproblems. The 
follow-up work [LT20] generalized this work for broader problem classes based 
on tools from [ST19b] and a polynomial-time complexity is established. A 
general structure-exploiting interior point method for problems in form of 
(2.12) is presented in [PHA14], where the KKT system is solved via ADMM. 


Examples for distributed implementations of interior point methods of time- 
dependent problems (e.g. from optimal control) can be found in [ZLB08; 
Kan+14; Wor+14; HSS17] In these works, decomposition is typically done 
over time, i.e. the problems are often in form of (2.12) but where R is a time in- 
dex rather than a set of subsystems. An augmented-Lagrangian based parallel 
interior point method for this setting is given in [CSL16]. One recent exam- 
ple of using Schur-complement techniques is implemented in the PARDISO 
solver [Sch+01] in combination with the very successful interior point method 
implementation IPOPT [WB06; Ver-17; KFS18; KKS20a; KKS20b]. The 
main focus of these implementations is the applications to time-dependent 
problems from power systems such as multi-stage optimal power flow or 
security-constrained optimal power flow—however the methods used there 
are in principle applicable to many problems with a similar sparsity structure. 
A similar approach with application to spatial decomposition in power systems 
is presented in [Lu+18]. 
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Table 3.3: Typical properties of internal decomposition methods. 


update idea convergence pros/cons 

-g* HE  AKT\ [Ax*\ exploit structure fe S >> + very fast 

b a AK 0 Aak in H, A (+ solve O(t C/D" )/ + non-convex constraints 
decentralized) -k ky - steps expensive 

ktl _ k, kak OKT) OT“) 

* W =N PU Ax : / - distributed 

AF = aK + ak AF - no advanced features 
HE : block-Hessian (changing topology etc.) 


AF : structured Jacobian - globalization 


For a multi-agent setting, an approach based on primal barrier functions is 
presented in [BPJ17] and a primal-dual based approach is presented in [BJ17]. 
These works guarantee convergence with decentralized barrier parameter up- 
dates. 


An early partially parallel SQP version is presented in [Lei+03]. [Sch15] 
presents a commercial distributed SQP method. 


Pros and cons of internal decomposition methods 


Internal decomposition methods are in general well-suited for constrained 
optimization. They typically consider some form of constraint-linearization 
in their step computation, thus the curvature information of the constraints is 
effectively exploited often leading to very fast convergence of these methods 
also in the constrained case. However, as all of these methods are second- 
order (i.e. they use Hessian and Jacobian information), step-computation 
is in general much more expensive. Using Schur-complement and reduced- 
space techniques can substantially reduce the dimension of the KKT system. 
However, solving the KKT system is in general highly centralized. Moreover, 
these methods are typically not tailored to a multi-agent setting. They are 
rather designed for parallel computing. 


5 The convergence rate depends on the precision of solving the KKT system and on the “exactness” 
of the Hessian H*. Exact Hessians with an exact solutions to the KKT system leads to quadratic 
convergence, approximations like BFGS lead to superlinear convergence, cf. Example 2. 
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Table 3.4: Overview on main lines of research for distributed optimization. 


primal-dual primal internal decomp. 
1% generation 2" generation 


examples dual decomp., consensus, EXTRA, distributed IP, 

ADMM, distributed NEXT, distributed SQP 
T" subgradient, ... DIGing, ... 

distribution via dualization, averaging averaging linear algebra, 
alternating direction dualization 

constraints via projection projection projection barrier/active set 

section Section 3.1 Section 3.2 Section 3.3 
overviews in [HMI1; Boy+11] [NOW18; NOW18] [Wor+14; ZLB08] 

[Sta+16; NOR18] [Ned14; Yan+19] 


[NOW 18; Yan+19] 


For our applications in power systems and control, techniques from internal 
decomposition seem to be excellent fits as we often have non-convex objectives 
and constraints. However, one challenge is to reduce the complexity for solving 
the KKT system. 


Mixed approaches 


Finally, there are a few algorithms which are hard to categorize along the above 
lines of research. One of them is an alternating trust region method [HJ17] 
which converges linearly to local minimizers. The second one is the Aug- 
mented Lagrangian Alternating Direction Inexact Newton (ALADIN) method 
[HFD16]. ALADIN can be viewed as a mix between primal-dual and SQP 
methods combining distributed optimization with the fast (superlinear or even 
quadratic) convergence properties of SQP. In contrast to many other methods 
from before, these two methods are able to handle non-convex constraints. A 
decomposition scheme for time-wise decomposition of optimal control prob- 
lems based on ALADIN is presented in [Kou+16]. 
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3.4 Comparison of algorithms 


Let us assess the before mentioned algorithm classes in view of desired prop- 
erties for distributed optimization. Classical primal-dual algorithms such as 
ADMM and dual decomposition are well investigated and work robustly for a 
wide range of practical (convex) problems. Moreover, convergence guarantees 
have recently been extended to special classes of non-convex problems. For 
separable problem structures, they are able to operate in a decentralized fash- 
ion although centralized parameter tuning might be required. Their amount of 
communication is usually very small and recently also advanced features such 
as operating on directed graphs or asynchronous operation have been investi- 
gated. However, in case of constrained problems, certain undesired alternating 
projection effects might occur (cf. Section 2.3). 


Primal algorithms do very well in decentralized operation as they do not assume 
a special structure of the problem and the communication graph can be chosen 
independently from the problem formulation. Moreover, similar to primal- 
dual methods, the communication overhead per iteration is quite small and 
many of these algorithms are able to operate on directly graphs with changing 
topology and asynchronously. On the other hand, at least first-generation 
primal algorithms are very slow. The convergence rate of second-generation 
algorithms can be similar to primal-dual methods. A drawback of primal 
algorithms is that considering constraints (even convex ones) is often not 
possible. 


Internal decomposition methods are often not decentralized as they often aim 
for parallel computing and computational speedup than for a multi-agent set- 
ting. As they often solve a reduced KKT system, the communication overhead 
is quite large and the coordination problem might be expensive to solve due 
to additional centralized operations such as the update of barrier parameter in 
interior point methods. However, very fast convergence is often guaranteed 
also for problems with non-convex objective and constraints. 


Lastly, ALADIN is a mixture between primal-dual methods and internal de- 
composition methods inheriting the fast convergence properties of internal 
decomposition methods and the convergence guarantees for constrained non- 
convex problems. Moreover—in contrast to internal decomposition methods— 
ALADIN offers of distribution by detecting the active constraints locally and 
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Table 3.5: Existing works in view of desired properties for distributed algorithms. 


primal-dual primal internal dec. ALADIN 


decentralization + ++ -- = 
convergence guarantees conv. conv.  non-conv  non-conv 
convergence speed o (--), 0 ++ +++ 
communication ++ ++ --- >s 
broad applicability ++ -- + + 
constraint-handling + - ++ ++ 
advanced features® + ++ -- sie 


reducing the coordination step to a linear system of equations. However, com- 
pared to primal-dual and primal methods, its coordination and communication 
effort is quite large. The properties of primal-dual, primal and internal decom- 
position methods in view of the requirements for distributed optimization are 
summarized in Table 3.5. Therein (+) and (++) indicate that the considered 
algorithm fulfills the listed property and (-) and (- -) indicate that it does not. 


Prospect for application in power systems and control 


The present thesis has mainly applications from power systems and optimal 
control in mind. Characteristic in these two classes of problems are non- 
convex constraints which essentially leave two classes of distributed algorithms 
for further investigation: primal-dual methods and internal decomposition 
methods. As ALADIN is a mix between these two it seems to be an excellent 
fit. However, as indicated above, ALADIN has a quite heavy coordination step 
and requires a quite large amount of central communication. Motivated by 
this fact, a main focus of the present thesis will be to develop a decentralized 
ALADIN-based algorithm preserving its favorable convergence guarantees and 
reducing its communication requirements. In the next chapter, we present a 
new algorithmic framework fulfilling these requirements which we call bi-level 
ALADIN. 


With advanced features we mean features such as asynchronous operation, the ability to handle 
communication delays, operation over directed graphs or a changing graph topology. 
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In the previous sections we highlighted that the core advantages of ALADIN 
are its fast local convergence guarantees, and an efficient handling of non- 
convex constraints. Disadvantages are a large amount of communication and 
a missing decentralization. This chapter proposes a framework in which both 
is possible—operation based on neighbor-to-neighbor communication (de- 
centralized optimization) and reducing the communication overhead while 
preserving ALADIN’s fast local convergence guarantees. 


By having a closer look at Algorithm 5, one can observe that only the coordi- 
nation step 2) requires central coordination and global communication.! Thus, 
the idea of this chapter is decentralize this step by means of decentralized inner 
algorithms. 


Large parts of this section are based on [Eng+20; Eng+19b]. The sparsity 
encoding techniques and the decentralized conjugate gradient algorithm are 
improved versions from [Eng+20] and not published so far. The convergence 
analysis in Section 4.5 is a unified version of [Eng+20], which does not require 
any results from [HFD16] and is purely based on nonlinear programming 
theory from Chapter 2. Similar approaches can also be found in the literature 
on numerical methods for optimal control [Kou+16; FSD15]. 


4.4 Reducing communication by condensing 


In a first step, we apply so-called reduced-space or condensing techniques to 
the coordination QP of ALADIN. By doing so, we reduce the dimension of the 


! A globalization also requires either central coordination or costly additional communication. 
However, as stated before, here we focus on a /ocal algorithm. Designing distributed globaliza- 
tion routines is still an open problem. 
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QP and thus ALADIN's communication overhead. The basic ideas for doing 
so are from the literature for nonlinear programming [NWO6]. 


Recall the coordination QP of ALADIN including slack variables (2.25) 


1 
min X. SAT Bi Ax + Vf (xp )Ax + AP d ALIE (4.1a) 
ie 


Ax,d 

subjectto CřAx;=0 VIER, (4.1b) 
2 AGE + Axi) -b =d  |a®. (4.1c) 
ie 


The first technique we apply is the so-called nullspace-method [NW06, Ch. 
12] eliminating the constraint linearizations Qr € Rm" from (4.1b). To 
this end, we need a basis of the nullspace of C j3 Let us construct a matrix 
zk € R^sXQsi-nh) with CES = 0 for all 7 € R forming a basis of this 
nullspace with its columns.? This allows parameterizing Ax; — ZF^v; and 
thus 


l. x 
min Y Av] BK Av;  gP Av; + AF d + Zld] (4.22) 
Av,d £4 2 2 
ie 
subject to» Auk AkAy;-b=d  |a®. (4.2b) 


ie 


In case of many active constraints, also the reduced Hessian Be = ZO BM 
the reduced gradient g = QS and the reduce coupling matrices AF = 
AiZF are much smaller in dimension compared with (4.1). We also reduce 
dimension of the decision vector Ax; from nyi to ny; — ur. Hence, we reduce 
communication when solving (4.2) instead of (4.1) in step 2) of ALADIN. A 


detailed communication analysis will be given in Section 4.6. 


2 If rows of C k are linearly dependent though, the number of rows of Zk is nxi — nk, + dk, 
where d. k is the number of linearly dependent rows of C K x 


58 


4.] Reducing communication by condensing 


The Schur complement 


Employing the Schur-complement allows to further reduce the dimensionality 
of the QP (4.2). For doing so, we need form the inverse of Be. Hence, we 
assume the following. 


Assumption I The reduced Hessian approximation Bk is symmetric and posi- 
tive definite for all iterates k € N and all i € R. 


This assumption holds for example if Vx.£(x*, A^, p^) is positive definite on 
the range space spanned by V/A(x*) for all k € N. However, it also holds by 
choosing an appropriate Hessian approximation B such that the assumption 
holds. We will further comment on how to do so in Chapter 6. Moreover, we 
assume the following. 


Assumption 2 The matrix A = [Aı,..., Ar] has full row rank. 


As Bk > 0, the KKT conditions (2.2) for problem (4.2) are necessary and 
sufficient for a global minimizer to (4.2). The Lagrangian to (4.2) is 


d. i 
LP (Ay, d, P) = V7 SAV BiAvieg Avi + A d ALIE 
ie 


HART Auf + AFAvi b - d]. 
ic 


Thus, the KTT conditions V.£9P (Av, d, AQP) = 0 read 


B' Av + g* + AP = 0, (4.32) 
AK + pd - A9? z 0, (4.3b) 
A*x* + AFAy b d =0, (4.3c) 


where B* = diag,.g(B*), g^ = (oF pees " A = (A,,...,Ar) and 
Av= (Av, Er Au)"; Here, diag;. 4 (Hi) denotes diagonal concatenation of 
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matrices H; indexed by the set &. By eliminating (4.3b) with s = z(a% -A*), 
we obtain 


B' Av 4 g^ + AF A® =0, (4.4a) 
8 sy 1 
A935 + A = b- (P Lay sup. (4.4b) 
p 
In order to further reduce dimensionality of (4.4), we rearrange (4.42) as 
Av = BE (gk AE 49), (4.5) 
where we used Assumption 1. Inserting to (4.4b) yields 
— 1 . -k =k- 1, 
Ak BEV ART 4 Lg aŒ = akxk — AFB* pk b ak. (4.6) 
p 


—— p 
Sk 


IS 


Exploiting block structure 


Because of the block structure of B* and A in (4.6), one can write 


k _ gk pk! akr LN gk gk! akt _ k 
st = AK Re A Don A! 2^ (4.7) 
tE 1€ 


: lk nk Tk 
with S% = AF Bk A, Moreover, 


s* = Ax* — ARBRE" g^ 23 - Ā; BK pk a> ar (4.8) 
ie ie 


with s% := ARE -= A,BF ' gk., Observe that the variables S% and s% can be 
computed entirely locally the subsystems i € R. Hence, it suffices that the 
subsystems communicate these two quantities to a coordinator, which then 
solves (4.6). 


Note that (4.6) is a linear system of equations of dimension of the number 
of consensus constraints ne, which is typically much smaller than the total 
number of variables n + ng + nc. If one solves (4.6) for AP, all ru and Ax; 
can be computed by back substitution in (4.5) and by AX? = Zi k Ayk, 
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Sy $5 
0 . .... 0 
5 . .... 5 
ol HE | 0 
15 15 
0 5 10 15 0 


Figure 4.1: Sparsity patterns of (5; };egı,2,3, for the 5-bus system (Figure 5.2b). 


In summary, this subsection showed that instead of solving (4.1) one can solve 
(4.6) in the coordination step 2) of ALADIN (Algorithm 5). This reduces 
communication and coordination drastically in case of problems with many 
active constraints. In this case, it suffices to communicate sensitivities of 
reduced dimension (Schur complements) S 1 and vectors s$ instead of the full 
sensitivities. However, note that although this version of ALADIN requires 
less communication, ALADIN still requires central coordination. In the next 
subsection we derive an algorithm solving (4.6) in a decentralized fashion 
rendering ALADIN a decentralized optimization algorithm. 


4.2 Decentralization of coordination 


Next we develop two algorithms solving the reduced coordination step (4.6) in 
a decentralized fashion, which means purely based on neighbor-to-neighbor 
communication. For doing so, we first analyze the sparsity pattern of S;. 


The sparsity of S; 


Although the matrix S is generally dense, the matrices S; are not necessarily. 
Figure 4.1 shows the sparsity patterns from an optimization problem from 
power systems which we will use in Chapter 5. Here, certain rows/columns 
seem to be structurally zero. This is the case since often the rows of A; 
express coupling between variables in two different subsystems. The zero 
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rows/columns in 5$; come from the zero rows in A; of all remaining subsystems. 
In the following we call the two subsystems i, j € R, which share a common 
non-zero row in their A; assigned to this row. 


Definition 6 A subsystem i € R is called assigned to consensus constraint 
ceC=1{l,...,n.} if the cth row of A; is non-zero. 


Moreover, we collect all subsystems assigned to consensus constraint c € C in 
(c) := {i € R | i assigned to c € C}, 
and we collect all consensus constraints assigned to a subsystem i € R in 
Cl) := {c EC |i e R(c)}. 


With that, we are ready to make a statement on the sparsity of S;. 


Lemma I The rows and columns of S; and the entries of s; with c ¢ C(i) are 
Zero. 


Proof. By Definition 6, all rows of AF - AiZF with c ¢ C(i) are zero. It 
follows immediately that the rows and columns of hys - AF B» AKT and the 
entries of sf = AE : m A;BF gk with c ¢ C(i) are zero. m 


4.2.1 Exploiting sparsity for decentralization 


The idea now is to exploit this sparsity for decentralization. Our goal is to 
formulate (4.6) as 


» si E (4.9) 
VER ief 


where 8 and ae are formed entirely locally and preserve the sparsity of Sk 
and s% . Two quantities hindering us are zl and (oak — b) in (4.6) which do 
not “belong” to any of the subsystems. 
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The main idea in the following is, to "distribute" al and ( a — b) to subprob- 
lems, which are assigned to the corresponding consensus constraint. To this 
end, we define 


Sk tow (©) , ak k V tea) (c) x 
and 5; :=s; + (A -be)ee. (4.10) 
>) Role“ 2 RIP 


where Ie is a zero matrix except for [/].. = 1, tc(-) is the indicator function 
of the set C, and ec € R” is the cth unit vector. This way, the sparsity patterns 
of S k and sk are preserved and (4.6) is reformulated in form of (4.9). 


This sparsity pattern can in turn be exploited by ADMM leading to a decen- 
tralized ADMM. To this end, we express (4.9) as a strongly convex QP. 


Lemma 2 The QP 
1 R R 
: IT 6.3 <T7 
min „A 2,54- D5 (4.11) 
is strongly convex, it's minimizer is unique and solves (4.9). 


Proof. First, we prove strong convexity of (4.11), ie. $ := x Š; > 0. 
By Assumption 1 we have Bk > 0 and hence B* > 0. Thus, B* > 0. 
Furthermore, by Z* having full column rank, A* = AZ* has full row rank by 
Assumption 2. Hence, y := A*x # 0 for allx # Oand thus Sk = A* BE ‘ART > 
0 by y BE" y > 0. By the definition of $, (4.10), we have $ = S + al > 0 by 
S>0. 


Let us show that $; = ST. We have ST = (A* BE s Ak (BE ')T ART = 
A* (BF )*! A*T. By Assumption 1, we have BK = B* and hence the assertion 
follows. 


It remains to show that the minimizer of (4.11) is unique and coincides with 
the solution to (4.9). Uniqueness follows from strong convexity. Again from 
strong convexity, the first-order necessary conditions are also sufficient for a 
minimum. They read I(XR, Ši + Ne STJA - bm 5; = O and thus coincide 
with (4.9) by symmetry of S. o 
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Ica) les) 
Or Ice) 0 » 
E 0 . s 
5 we “eo, 5 EA 
o 5 ra a 
10 . "e 10 E 
s 0 5 10 15 z 
0 5 10 15 0 5 10 15 


Figure 4.2: Sparsity patterns for Uc het 23y 


4.2.2 Decentralized ADMM 


To derive a decentralized variant of ADMM (d-ADMM ), we restate (4.11) 
in so-called consensus form (A.9). We introduce matrices /c(;) € RIC()Ixne 
mapping from global Lagrange multipliers A € R” to local ones 4; € RICCI 
by A; := Ic(i)4. These matrices are defined as the horizontal concatenation of 
unit vectors e. € R"« indexed by the set R as Ig := [23 T e RIRIX"c Alter- 
natively, these matrices may be viewed as identity matrices, where certain rows 
have been “deleted” eliminating zero rows/columns in $; and s;. Figure 4.2 
shows exemplary sparsity patterns of {/¢(i)}icr for an example from power 
systems from Section B.4. The matrices /c(;; have an interesting property, 


namely that S; and 5; are invariant under multiplication with 77 Cli plea y 


Lemma 3 (Invariance of Ši and 5; under multiplication with I Io Tow) It 
E = 5; 
holds that Zo) Toy Si = §; and 151. ca) = Si. 


Proof. By definition, Ica ) is composed of unit vectors for which e; e gm 1, if 
and only if i = j and e? e; = = Oelse. Thus, Ioa Mea j) is an identity matrix with 
Zero rows/columns for ail rows/columns indexed by c ¢ C(i). As these rows 


are anyways zero in $; and 5; by Lemma 1, the assertion follows. Oo 
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Algorithm 6. d-ADMM for problem (4.12) 


Initialization: aD, y? for all i € R, p 
Repeat for all i € R: 


1 Gk P. z 
1) antl = (s! + pl) (s -y" + pit) (local) 
2) ie =(A;)7! Djen(i) nat (neighbor-to-neighbor) 
3) yn*l = yt + p(t) = Tous anb (local) 


With that, we are able to reformulate (4.11) in consensus form. Con- 
sider f;(A) := iA Sià — STA as local objectives. By Lemma 3, f(A) = 
fica cq) = Fet. Thus, problem (4.11) can be written as 


ii _ > Alad 
oF RA TER (4.12) 


subject to A; = Ic(i)À | yi forall i eR, 


with Lagrange multipliers y; € RIC!. 


Solution via ADMM 


Let us evaluate the ADMM iterations (Algorithm 3) for problem (4.12). The 
Augmented Lagrangian for (4.12) is 


LP (As... AR A, yp... YR) = 


= p = 
D AoA +y Qi = Lew) + FIIs - Teoh. 
VER 


(4.13) 


Necessary conditions for minimizing (4.13) with respect to Aı,..., Ar are 
Skat! — gk + y? + p(a7* — a") 20, foralli e 8, 


where SE := I cos teu and s.: = = Icast. Rearranging terms yields step 1) 


of Algorithm $ 
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Step 2) of Algorithm 3 minimizes (4.13) with respect to A. The first-order 
optimality conditions are 


227169) -PaA - Loma") = 0. (4.14) 
ie 


Step 3) of Algorithm 3 is 
y m en "A +p(am! _ Iowa"). (4.15) 


Summing up for all i € R yields 
2 ye 2 + pa = Ica’). 


iER iER 


Multiplying by I, 7, from the left together with (4.14) implies that X jeg T} Cli) ys 
O after the first ADMM i iteration. Hence (4.14) becomes 


T 1 = 1- 
2 Ror- Iole" =0. (4.16) 
iER 


Multiplying by 7 from the left yields 


cu) 


#1 _ +1 _ 
MAT Loa ai - COPA cat 70. (4.17) 
i¢R—__} — 


a ae E o 
—lj 
=A 


. x, 2 i à 
Since loge = J and AB = BA for diagonal matrices A, B, we have 


Ar, AU = Aj p 


qnl T qnl 
AA T Ig M CG) lec) 


Tov) loj) C) CC) Tew) 


where we define A; := oa ei ME Hence, (4.17) can be written as 


Are) a, 


ie 
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Ti» PE 
0r 0 
. 
. 
4 ° 4 
e 
e 
e 
8 8 . a 
e 
. 
12 12 . 
0 4 8 0 4 8 12 
Figure 4.3: Coupling matrices /;? and 1,3 corresponding to the matrices [celieu 2,3} shown 


in Figure 4.2. 


As Ij; = 0 for j € N(i), we can replace i € R by i € N(j), where we define 
the neighborhood N(i) of subsystem i € R by 


N(i) := {j € R | there exists ac € C with i, j € R(c)}. 


The identity /;; = 0 for j € N (i) follows from this definition and the definition 


of Io Ga) This yields step 2) of Algorithm 6. 


Note that in Algorithm 6 we have entirely local steps (step 1) and step 3)) 
and one neighbor-to-neighbor step (step 2)). This means that the algorithm 
is decentralized, i.e. all communication is done locally between neighbors 
and a central entity is not required. Superscripts (-)" denote d-ADMM it- 
erates. Quantities with superscripts (-)* do not change during the d-ADMM 
iterations—they refer to (outer) ALADIN iterations. Furthermore, note that the 
initial guesses for A have to be consistent, i.e. they have to satisfy a = Ici) 49 
for all i € R. 


Observe that the matrices Jj; € RICGIXICGDI! encode “coupling information" 
between the subsystems. More specifically, they define which entries of the 
vectors A;, and in subsystem i € R belong which entry of the vectors A; in 
subsystem j € R. Furthermore, observe that /;; = 7. The sparsity patterns of 
I1? and 1,3 corresponding to /c(1). /c(2). /c(3) from Figure 4.2 are illustrated 
in Figure 4.3. One can observe that subsystem 1 only requires the first four 
entries of 4» from subsystem 2. 


d-ADMM is guaranteed to converge to the minimizer of (4.12) and the mini- 
mizer of (4.12) corresponds to the solution of (4.9) by Lemma 2. Hence, we 
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Figure 4.4: Communication graph for our power systems example. 


achieved our goal of deriving a decentralized version in the coordination step 
of ALADIN. 


Remark 10 (Global variable consensus vs. general consensus) An alternative 
way of reformulating (4.11) would be via the (simpler) so-called global variable 
consensus [Boy+11] 


Biss: TER (4.18) 
subject to A; = À | yi, for allie R. 


as we did in [Eng+20]. However, this leads to the situation that f = Vice fi 
loses it's strict convexity property with respect to the inflated variable block 
AT = (A[,.... Ag) if not all f; depend on all components of A; as we have 
here because of the sparsity of $;. This is not the case in (4.12) since we 
eliminate locally redundant variables by means of the matrices 7c(;). This is 
important since convexity vs. strict convexity can lead to sublinear vs. linear 
convergence for ADMM [MO17; HY12; GB16; HL17; Shi+14]. 


Remark 11 (Communication graph) One can regard the set of subsystems R as 
nodes and the set & := {(i, 7) € RXR | there exists ac € C with i, j E€ R(c)} 
as edges in a communication graph G = (N,&). The edges of this graph 
can equivalently be characterized by all (i, j) for which /;; + 0. For the 
example from power systems from Section B.4, the graph is fully connected 
and displayed in Figure 4.4. 
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4.2.3 Decentralized conjugate gradient 


Using d-ADMM as an inner coordination algorithm has the advantage that only 
very few information has to be exchanged between neighboring subsystems, 
namely the average primal values 4. However, ADMM requires tuning which 
might be difficult (cf. Section 5.5) and the convergence rate is at most linear. 


To address these drawbacks, the idea of this section is to develop a decentral- 
ized variant of the conjugate gradient algorithm (d-CG) with fast convergence. 
A main advantage of CG is, its convergence guarantee in af most n. steps to the 
exact solution of a linear system of equations [NW06, Ch. 5.1]. Although this 
property is weakened in practice due to numerical round-off errors, the prac- 
tically observed convergence rate is superlinear, which is very fast compared 
with other methods such as ADMM [BKO1]. 


The idea of CG is to iteratively minimize (4.11) by generating iterates {a} 
which are S-conjugate to each other, ie. A"'$4" = 0 for all n + m. Š- 
conjugacy can be seen as a generalization of orthogonality. Performing a 
one-dimensional minimizations along these conjugate directions {A"} yields 
the conjugate gradient algorithm [NW06, Ch. 5.1]. The iterations are given 
by 


nTn 
a" = p r ; (4.192) 
p'' Sp” 
qn = qn + wp” (4.19b) 
pul, a" Sp", (4.19c) 
nlT .n+l 
gu e (4.19d) 
pnt pn 
p = pri + B"p*, (4.19e) 


where the initialization has satisfy r° = p? = 5- $49. Now, let us again use the 
sparsity results from Lemma 1, Lemma 3 and the following additional result. 


Lemma 4 It holds that Y;c« I vU Iggy =. 


Proof. Since A is diagonal, and by definition of / ca) as the concatena- 
1 - =]. — T = a 
tion of unit vectors, we have A; = ( Ici Mo a» E Pu Toy: 
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Combination mu Lemma 3 and AB = BA for diagonal matrices yields 
= T -1 7T = 
Vier Moi loq 7 Lier Towtew’ leq) Ie) ANE 
zip. T —- ll =I us 
Lier Bulow luo Tow A = Lier legle 5AA =L S 


The main idea in decentralized conjugate gradient is to exploit sparsity in 
(4.19a)-(4.19e) by the mappings Io i) which we used already for d- ADMM. Let 
us start by defining—similar to d- ADMM—local equivalents for all involved 
variables 4, r and p 


Apistd apa; i= Low? and  pj:- Ic(P- (4.20) 


c(i) 


bo these definitions, we decompose (4.19a). For doing so, we define a” = 
T. and 7” :— r"' r^. The idea is to write this identity in terms of r” instead 


ofr”. By Lenis 4, we write 7] as 


nT nN  ,nT n _ p n 
m rer IT x gar is) T^. 
ie 


Defining n; := r"" A; !r" and by using (4.20) we obtain 


an ud c nTA-l,n _ : 
n-r r =)? A; r= yon, 


TER TER 


where 7; can be computed locally. However, note that the sum Ji jeg 7; requires 
global communication of one scalar n; per subsystem. This yields step 3) (a) 
and step 4) of Algorithm 7. Next, we decompose the denominator c" := 
p" Sp” in (4.192). By the definition of $, p; and by Lemma 3, the denominator 
can be written as 


n. p" > Šip" = p >> Duos ap" 23 p? sip” EX of 
ief ief IER ief 


This yields step 1) and step 5) (c) of Algorithm 7. Let us consider (4.19b) and 
(4.19e). They are entirely local steps since multiplying both equations by 7 


: C) 
from the left yields 


3jn-l jn j <=. Iı yt! k 
A eg + a pn and pU pr T Trp! 
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Algorithm 7. d-CG for problem (4.12) 
Initialization: 2° and r = p? = 3- $00 
Repeat for all i € R: 


1) a = Tego? (scalar global sum) 
2) pay = T Ljenci) liju (neighbor-to-neighbor) 
3) (a) thar arin! (p) ant = An 4 Zp; (local) 
4) Atl = Dier qe (scalar global sum) 


n+l »" » 
5) (a) ppt! = rpp? (p) u? = $jp?*1. (c) oft! = p™ 17S; p?*! (local) 


comprising step 5) (a) and step 3) (b) of Algorithm (7). In a last step, we need to 
decompose (4.19c) which requires neighbor-to-neighbor communication due 
to the product Sp” as we shall see. Again, by Lemma 3 and by the definitions 
of p; and Sj, we have 


i m" m" nt 

nel ..m- HT GP uH o o "um We NE. M T Q yn 

r =F HNP =r = S;p"=r = 25 
JER JER 


Multiplying by 7/..,., from the left yields 


C) 


n n 
pe x ph T >) lysp? =r? - E NETS 
jeN(i) jeN(i) 
where we recall that /;; := Ioue and uj = Sip". Note that we only 
sum over the neighbors of subsystem i as /;; = 0 if j # N(i). Moreover, the 
quantities wj -$ j p can again be computed locally by each subsystem. This 
yields step 2) and step 5) (b) of Algorithm 7. 


Remark 12 (Decentralized operation via hopping) Note that although for com- 
puting 7 and o global communication is required, this communication may 
also be executed in decentralized fashion via “hopping”. By that we mean that 
each subsystem adds it's 7; to the sum which was received from a neighbor 
until each subsystem was reached. By a second “hopping” round, the sum 
can then be “broadcasted” to all subsystems. In general, such an approach is 
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somewhat problematic as it requires all subsystems to wait until one hopping 
round has to be performed, which can potentially take a long time in case of 
bad local communication links as these delays sum up. However, as in d-CG 
we only have one scalar per subsystem independently of the problem size, this 
approach might be justified. 


4.2.4 Consistent initialization 


Note that d-CG requires a consistent initialization satisfying r? = p? = $— $49. 
Multiplying this equation by Loy from the left and using Toto yi = Si 
yields 
0 0 x 6.30 c 6 0 
ri = Pi = 2: Loy Si = 2: Toy Sia = ` Iij$j = > 1,8745: (4.21) 


JER JER JER JER 


Thus, when using a zero initialization A? = 0, one needs one extra neighbor-to- 
neighbor step for initialization of d-CG. When using warm starting—i.e. using 
the solution of the previous outer bi-level ALADIN iteration as an initial guess 
a6 + 0—one needs one additional neighbor-to-neighbor step for computing 
the second term in (4.21). Note that the vectors 2° are all known locally from 
the previous outer bi-level ALADIN iteration. For d-ADMM such a step is 
not required as 5; is explicitly considered in step 1) of Algorithm 6. For warm 
starting, one can use A directly from the previous iteration. 


4.3 Comparing d-ADMM and d-CG 


Next, we briefly compare important properties of d-ADMM and d-CG. 


4.3.1 Information exchange 


The required information exchange and the locally maintained variables for 
d-ADMM (Algorithm 6) and d-CG (Algorithm 7) are graphically illustrated in 
Figure 4.5. Both algorithms require the same amount of information exchange 
between neighbors: whereas d-ADMM exchanges the matrix-vector product 
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I31À3 I31ua 
(As, ya) 
ZZ (Aa, pa. T3, Us) 
Ar) (A1,P1, T1, 1) 
(2,72) (A2, p2, T2, U2) 


Figure 4.5: Information exchange and local variables for d-ADMM and d-CG. 


Table 4.1: Properties of d-CG and d-ADMM for problem (4.12). 


convergence rate d-CG d-ADMM 


theoretical n.-step linear/sublinear 
practical superlinear linear/sublinear 
tuning no yes 
local variables four two 


IijAi, d-CG exchanges the matrix-vector product /;;u;. A key difference in 
d-CG is the additionally required global communication of the two scalars 7 
and o per iteration. Moreover, for d-CG, one has to maintain four variables 
per node, whereas d-ADMM requires to maintain two variables per node. 


4.3.2 Convergence properties 


d-ADMM and d-CG exhibit different convergence properties. The conjugate 
gradient algorithm (theoretically) yields the exact solution in at most n steps 
[NWO6] which is a very strong guarantee. However, in practice convergence is 
typically slower as conjugate gradient is quite sensitive to numerical round-off 
errors coming from finite precision arithmetic. Nonetheless one can observe 
superlinear convergence [BK01]. The convergence rate of ADMM is either 
sublinear for convex problems or linear for strongly convex problems, cf. 
Chapter 3. A further advantage of d-CG compared with d-ADMM is that no 
tuning of the step size is needed as this is done “automatically” in step 1) and 
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step 4) of Algorithm 7. The properties of d-CG and d-ADMM are summarized 
in Table 4.1. 


Remark 13 (Different version of d-ADMM in [Eng+20]) In [Eng+20] we 
used a different version of d-ADMM. A major difference is that in the present 
version, redundant variables in (4.12) are eliminated and thus all f; are strongly 
convex. Hence, in the present case, we obtain linear convergence (cf. [Yan+16; 
WO13; NOR18]) Moreover, note that the derivation of D-ADM and d-CG in 
[Eng+20] is different as we do not use the matrices 7;; there for encoding sparse 
communication. 


4.4 Bi-level distributed ALADIN in summary 


The overall bi-level distributed ALADIN algorithm is summarized in Algo- 
rithm 8. Observe the similarity to basic ALADIN in Algorithm 5. There are 
two main differences between these two algorithms: First, instead of computing 
full sensitivities like gradients, Hessians, Jacobians in step 1) of Algorithm 5, 
one computes Schur complements $; and 5; by (4.10) reducing the amount of 
reuquired communication compared to Algorithm 5. Moreover, in step 2), the 
condensed coordination step is computed in a decentralized fashion leading to 
an overall decentralized algorithm. The residual IIr* || quantifies the “amount of 
inexactness" in the coordination step due to decentralized conjugate gradients 
and decentralized ADMM and will be defined in the next subsection. 


4.5 Convergence analysis 


Recall that d-CG and d-ADMM solve the coordination QP (4.1) with a finite 
precision only. Thus, the convergence analysis from [HFD16] can not directly 
be used for bi-level ALADIN. In the following we show, that the favorable 
convergence properties of ALADIN can be preserved if the error in the in- 
exact solution to (4.1) and the error due to Hessian approximation decays 
fast enough. For doing so, we use techniques from inexact Newton methods 
[DES82; HFD16]. Our analysis contains the analysis of basic ALADIN as a 
special case. The overall proof is outlined in Figure 4.6. The main idea is to 
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Algorithm 8. Bi-level distributed ALADIN 


Initialization: Initial guess (29, 2°), parameters X; > 0, v, p > 0. 
Repeat: 


1) Solve in parallel for all i € R 


2 


k 
Xi- zi 


v 
e =argmin fj(xj) + (AF) Aix; += | 
x; €; 2 


(4.22) 


and compute condensed sensitivities S; and 5;. 


2) Coordination Step: Solve (4.9) centrally or decentrally with d-CG or d-ADMM 
with residuum Irk || small enough such that (4.33) holds for ARH, 


3) Set akt! — AK and ze — ar + Ark. 


combine the property that the local problems in step 1) of ALADIN form a 
Lipschitz continuous map in terms of (z*, A*) and step 2) can be viewed as an 
SQP step making results from Newton-type methods for fast local contraction 
applicable (cf. Section 2.1.2). 


Lipschitzness of the parallel step 


Let us start by analyzing the Lipschitz property of the parallel step 1) in basic 
ALADIN (Algorithm 5). This step is important mainly because of two reasons: 
It minimizes the local NLPs while at the same time staying close to the previous 
iterate z^, where the distance to z^ can be controlled by v. The hope is that the 
minimizer of the NLP brings us closer to a local minimizer (although this is 
theoretically not proven so far, but often it does in practice). The second and 
maybe even more important feature of this step is that it determines an active 
set A(x*) for which the constraint linearizations of the inequality constraints 
are held constant in the coordination QP. This renders the coordination QP an 
equality constrained QP instead of a mixed equality-inequality constrained QP, 
which is substantially cheaper to solve. 


Next, we use standard results from parametric programming to show that the 
mapping formed by step 1) of basic ALADIN is Lipschitz continuous with 
respect to (z*, A*). We will then use this result to show show overall fast linear 
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“Lipschitz” 
local step (Theorem 2) coord. step 
ne zn gh 
aa [t ak — | zk 


AF 


quadratic contraction 
(Theorem 3) 


(a) Variables in local/coordination step. 


local step is Lipschitz contraction properties 
w.r.t. (z^, A*) of Newton-type methods 
(Theorem 2) (Theorem 3) 


ideas from inexact 
q " d Newton methods 
covergence of 


bi-level ALADIN 
(Theorem 4) 


(b) Proof dependencies. 


Figure 4.6: Outline of the bi-level ALADIN proof. 


to quadratic local convergence of bi-level ALADIN. The theorem is originally 
from [Rob80, Thm 4.1], but here stated in form of [FI90, Thm 5.2]. 


Theorem 5 (Lipschitz property for parametric programming) Consider the 
parametric programming problem 


min f(x, p) subjectto g(x,p) =0 and A(x, p) x 0. (4.23) 


xeR" 


Let LICQ (Definition 2) and the conditions of Theorem 2 (SOSC) and hold at 
(x*, p) with u* > 0 (strict complementarity), where x* is the solution to (4.23) 
for a given p = p. Then there exists a y < co such that in a neighborhood of 
p it holds true that 


lx — x*ll < xllp — all. (4.24) 


By defining p" := (A*', zk) it is obvious to see that the local problems in 
step 1) in Algorithm 5 are in form of (4.23). Thus, if we are able to ensure 
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that LICQ, SOSC and strict complementarity hold at E we can use (4.24) in 
our local convergence analysis. Hence, we have to ensure that the Hessian of 
the Lagrangians sufficiently positive definite to ensure that SOSC holds. This 
can be ensured by choosing v and 2; large enough/sufficiently positive definite 
such that locally 


vi. ( fix) + ek hi(xf)) LO (4.25) 


is satisfied. This corresponds to the statement in [HFD16, Lemm 3]. 


Coordination is a Newton-type step 


Next, we analyze the progress towards a local minimizer in the coordination 
QP (4.1). We evaluate the first-order KKT conditions (2.2) for the coordination 
QP which is a linear system of equations. Then we compare this linear system 
to the KKT system from an Newton-type step applied to the KKT conditions of 
problem (2.12) to highlight their similarity. This will make convergence results 
from Newton-type methods (Theorem 4) applicable yielding local quadratic 
convergence of ALADIN. 


Let us start by evaluating the KKT conditions for the coordination QP (4.1). 
Note that as Bk > 0 on the nullspace of CF, the coordination QP (4.1) is 
strongly convex and thus the KKT conditions are necessary and sufficient for 
solving (2.25) (cf. Section 2.1.1). The Lagrangian to (4.1) is 


1 
L(Ax, s, RA) = Ax  B'Ax e V f) Ace AF s + si 
KT CF Ax + AT (4 (x + Ax) -b-s), 


where &' = (KT, ..., KR) are the Lagrange multipliers associated with (4.1b). 
Here we use concatenated notation for all variables, i.e. Ax’ = [Ax], S Axp) ‘ 
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A = (A1,..., AR), B* := diag BR), = diag;eg(C*) and Vf (3) = 
(V. fi (x) ^... V fa (x5) ). Thus, the KKT conditions read 

VaxL£(Ax, s, R, A) = B'Ax- Vf(x) -ATA- C" & || 20, — (426a) 

V,£(Ax, s,&,a) 2 A + ps-A -0, (426b) 

VeL£ (Ax, s, K, A) = CF Ax -0,  (426c) 

ViL(Ax,s,&,a) =A (s " Ax) zer -0.  (4264d) 


Rearranging (4.26b) yields s — HU — A*). Eliminating s in (4.26d), defining 
Ad := A — AR, and expanding (4.26a) with X C*" x and A" AK reveals that 


VaxL£(Ax, s, K, 4) = B'Ax + Vf (x ) + AAA ATA +CT Ak CF =0, 
V £(Ax, s, Ř, A) = C Ax = 0, 
VaL (Ax, s, £2) = Axe x) zb cpi - 0. 


with Ax = K — x. Thus, 


Be Ay CE VAS -V f(x) ATA SCIT e 
A = 11 0 ||A4|- -Axk+b (4.28) 
ck o0 0 JAAk 0 
ee ————M— EEEEEEEEE SE 
=M((xT, KTN") =:m( (x7, KT, A731 


The dependence of M on x stems from the dependence of V2, £ (x, A, K) on x 
(in case of exact Hessians). 


Similarity to an SQP step for problem (2.12) 


The Lagrangian of the affinely-coupled separable problem (2.12) reads 


L(x, KE, K1, A) = f(x) + Kg g(x) + Kj h(x) - AT (Ax — b), 


78 


4.5 Convergence analysis 


where f = Vier fis Kg = (Kg... Kgg) Kp = (qp. KigRhb 8 = 
(Bars h' = (hj,...,hg) and A = (Aı,...,Ar). Then, the KKT 
conditions read 


FQ) HVLC Kk, K7, A") 
Vi) + Vg(x*)' kx + Vh(x*) T AT a* 


u Ax* —b er (4.29) 
g(x*) 
(h(x*)) ar) 
where q' := (x',«',A') with x’ := (kp, xj). An exact Newton step applied 


to (4.29), VF(q*)Aq = F(q*), yields 


VL Ke a*) AT WAT (AC) aus Mf Ax 


A 0 0 0 AA 
Vg(x^) 0 0 0 AKE 
VU) ass 0 0 0 AK] 
AED) ace) (4.30) 
-V f (xk) - Vg(x*)" E - Vh(x*)' kk - ATA* 
_ —Ax* +b 
-g(x*) 
m (hx) a 


By comparing (4.30) with (4.28) one can see that for exact Hessians B* = v2 L 
and Jacobians CKT = (Vg(x*)" VAH) aan)» both equations coincide 
apart from the term —1/. For p — co both equations coincide exactly. Hence, 
the coordination step of ALADIN can be interpreted as an inexact Newton 
step to (2.12). Note that Theorem 4 allows for such an “inexactness” in the 
compatibility condition (2.10a), but the convergence rate depends on whether 
this inexactness vanishes asymptotically. 


Overall convergence under inexact coordination 


Now let us use Theorem 4 to prove local convergence of ALADIN. For doing 
so, we have to check whether the conditions of Theorem 4 are satisfied. The 


79 


4 Bi-level Distributed ALADIN 


Lipschitz condition (2.10b) holds by assuming Lipschitzness of the gradients 
and Hessians/Jacobians in f, g and h in (2.12) and by assuming that LICQ and 
SOSC are satisfied (to exclude the case that the KKT matrix is rank-deficient 
at q*). For the compatibility condition (2.102), let us assume that we use exact 
Hessians B* = V2 . £ (x*, «*, A*). Then, the compatibility condition reads (by 
using the left-hand sides of (4.30) and (4.28)) 


y: [a coh)! rt) - VEG) = ar coh» 7 pue - 1]. 


Since M*(p*) > VF(q*) for p* > oo, law * Go) converges to a fixed 
number and 1/p* converges to zero which means that we can make ^ arbitrar- 
ily small (and especially smaller than one) as required by Theorem 4. Thus, 
the conditions of Theorem 4 are satisfied? and we can use the contraction 
inequality (2.9) in our following convergence analysis. 


Apart from the inexactness introduced by the term 1I from before, we have 
a second kind of inexactness: the inexactness coming from an inexact step 
computation from the before-mentioned two inner algorithms d-ADMM and 
d-CG. Thus, we include this second source of inexactness in our convergence 
analysis. For doing so, we use ideas from inexact Newton methods [DES82]. 
Let us denote the primal-dual iterate after the coordination step 3) of basic 
ALADIN (Algorithm 5) with q" = (z^, &', A") and let us denote an inexact 
primal-dual iterate to step 3) by q. 


We analyze the effect of the inexactness. The distance of the inexact iterate 
g**! to a local primal-dual solution q* reads 


jg -q* | < jg a q^? + ||g*! = g*|| < 


lat! - epe p^ - pi? rot rl. 430 


where we expanded with +q**!, used the triangular inequality and the Newton- 
type contraction from Theorem 4, where we denote the primal-dual iterate after 
step 1) of basic ALADIN with p! = (x', x", A"). In order to preserve conver- 


gence, we need to bound |g! — g**! | in (4.31) representing the “inexactness” 


1 nl 3 9 ET 
3 We consider local convergence here. Thus, the condition ||g* — q*|| < XU has to be 
satisfied by a suitable primal-dual initialization. 
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of the solution to (4.1). In order to do so, let us define the residual to the KKT 
system (4.28) as 


rg = M(p^) (4! - pt) - m(p*, a). (4.32) 
Since M(p^) (q*! — p**!) - m(p*, A*) = 0 by definition of g**!, we have 
at -ql = (M (o9) (r$ + mG*, 2%) - m(p*, 2) || 
< IM) lrg: 
Combination with (4.31) and defining £ := ||M(p*)~!|| yields 


kel _ 


g - a* || < BlIr I + 5 lie - p* |? +y||p* - p*]|. 


By assuming LICQ, strict complemntarity and SOSC we know that by Theo- 
rem 5 that the mapping formed by the local problems in ALADIN Lipschitz 
with constant y. Thus, we get 


la^ - q* || < (ew - »*) x la^ - 4*|| ^ ax lla* - a*l . 


Enforcing a bound on the inaccuracy in the solution to the coordination QP 
stemming from inner algorithms 


lr < n lm, AI (4.33) 


yields 


le" -a*s (Bonk ev) x^ - ale rat. 434 


where 6 is a Lipschitz constant of m. 


Note that if the sequences (7^) and (y^) are bounded by constants 77, y small 
enough such that ||g* — q*|| < 2(1 — y Bon* — x y )(« x”)! is satisfied in 
each iteration, linear convergence of ALADIN is guaranteed. By additionally 
enforcing bounds on the sequences (7^) and (y^), i.e. n* = O(||g^ — q* ||) and 
y* = O(||g* — q*||) we can make the convergence quadratic. Alternatively, 
enforcing n*, y* — 0 yields superlinear convergence. 


4 Bi-level Distributed ALADIN 


We summarize our results from this section in the following Theorem. 


Theorem 6 (Convergence of bi-level distributed ALADIN) 

Consider bi-level distributed ALADIN (Algorithm 8) with 2; and v such that 
(4.25) holds. Suppose Assumption 2 and the conditions of Theorem 2 hold. 
Letthe solution to the condensed QP (4.9) satisfy (4.33) for a bounded sequence 
(n ken < n and let q be close enough to a primal-dual minimizer q*. 


Then bi-level distributed ALADIN converges to p* 


* at linear rate if 7 and y are small enough; and 


e at quadratic rate if n% = O(||g^ — q*||) and y% = O(||g* - g*]]). 


Theorem 6 follows directly from inequality (4.34) and the definitions of linear 
and quadratic convergence (cf. Section A.1). 


In basic ALADIN, the coordination problem (4.9) is solved exactly and thus 
n* = 0. Thus, basic ALADIN can be seen as a special case of bi-level ALADIN. 
Moreover, the convergence analysis given here can be extended to the case 
where the the local NLPs are solved with a finite precision only [Eng+19b] 
using similar techniques as we did here. We omit this extension as it adds more 
technical complications without adding much insight. Moreover, a BFGS- 
ALADIN variant we presented in [Eng+19b] is included in this framework as 
a particular choice of BY We do not consider this variant here. 


Remark 14 (KKT matrix modifications) Note that the compatibility condition 
(2.10a) and the corresponding requirements on {y*} cover a variety of modifi- 
cations of the KKT matrix M—not only the modifications with respect to zi . 
The error due to Hessian approximations like BFGS or the error due to reg- 
ularization procedures for ensuring positive definiteness of Bk can be further 
sources of inexactness, or errors because of approximations in the constraint 
linearizations C*. 


Remark 15 (Evaluation of the error condition (4.33)) In order to guarantee 
local convergence one has to ensure that (4.33) holds for a suitable sequence 
(n^) < n. Intuitively this means, that the residual IIrk || has to become smaller 
as ALADIN proceeds and thus the accuracy in the coordination step has 
to get higher as we get closer to a local minimizer as m(p^, A*) — 0 for 
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q% — q*. But evaluating TDI at each iteration is expensive and yields 
additional overhead, which one would like to avoid. There is an option to 
overcome this limitation: instead of evaluating lir one can equivalently 
evaluate the residual in the reduced system (4.9), P. This can be seen as 
follows. As we use the nullspace method, the last row of m in (4.28) is always 
zero by C kzk = 0. Hence, it remains to evaluate the first and second row 
of (4.28). Again, by using the nullspace-parametrization Ax = Z*Av from 
Section 4.1, one can enforce a zero residual in the first row (by multiplying 
with Z*" from the left and exploiting C*Z* = 0) 


BK Av + g* + A* 31 2 0, 


where we recall that A**! denotes the inexact version of A**!. Inserting into 
the second row yields the residual 


i. (A*(B*)- A LDAM AF (BE-1g — Ave + ak +b NJ 


With that, we have ri = (07 07 jy and thus TDI = ||r^||. Hence, we can 
use r$ for evaluating (4.33) instead of B 


4.6 Comparison of variants 


Next, we compare the two developed bi-level variants with basic ALADIN, 
ALADIN with condensing and ADMM as one of the most prominent primal- 
dual methods. We denote bi-level ALADIN with ADMM with b-ADMM 
in the following, and bi-level ALADIN with conjugate gradients with b-CG. 
Our evaluation mainly focuses on communication effort, coordination effort, 
convergence guarantees and practical convergence. 


Communication 


As reducing communication was one of our main motivations for developing 
bi-level ALADIN, let us start by analyzing the amount of communication. 
We analyze communication by counting the number of floating point numbers 
(floats) which have to be exchanged per iteration. 
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We begin by analyzing basic ALADIN. Here we have to communicate all com- 
ponents necessary to construct problem (4.1), i.e. the Hessian approximations 
Bj, the gradients of the local objective functions V he ), the Jacobians p and 
the primal variables x;. Note that we do not count the communication of A; and 
b since they stay constant during the iterations and have to be communicated 
only once. The Hessian approximation Bk is a symmetric matrix, requiring to 
communicate n; (n; + 1)/2 floats. The Jacobian C s calls for communicating 
(Ngi tni) nx; floats, where nk, is the number of active inequality constraints in 
outer iteration k. Furthermore, the primal iterates x; and the gradients V f; Gas ) 
lead to n,; additional floats from each subsystem. In total, this means that we 
get a forward commincation for basic ALADIN of 


k 
> Nyi (Nyt + 1)/2 + nyj(ngi + nj;) + 2nyi 
ies 


floats per ALADIN iteration if we do not exploit structural zeros in any of 
the before mentioned vectors/matrices. In the backward step 3) of ALADIN 
(Algorithm 5), ze and A"! have to be communicated leading to (jeg ny; + 
n.) floats in backward communication. 


For the condensed ALADIN variant, all subsystems have to communicate 
their reduced Schur-complement matrices S$; € RICMIXICOl! and vectors 3; € 
RICO! As §; is symmetric, this leads to a forward communication of 


2 Ico (Ca) 1/2» IcCG)I. 


ie 


In the backward step only A‘*! has to be communicated leading to ne floats in 
backward communication. 


For b-ADMM (Algorithm 6), the matrix vector product J; jan has to be 
communicated in each inner iteration. As the total number of non-zero entries 
in /;; is equal to the number of consensus constraints assigned to subsystem 
i € R, we have |C(i)| floats for for each inner ADMM iteration. Thus, we 
need to communicate >) ;<g |C (7)| in each d-ADMM iteration on a neighbor-to- 
neighbor basis. This yields a total forward communication of n^P cg |C(i)| 
floats per outer ALADIN iteration, where n^P is the number of d-ADMM 
iterations. As all A7 are known locally already during the d-ADMM iterations, 
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Table 4.2: Per-step forward communication (number of floats) for different bi-level ALADIN 
variants and ADMM. 


ALADIN variant global local backward 


i+1 E 
basic > fixi (= tno nk, + 2) - X lxi t hc 
icf icR 
C()|(1CG)] +3 
condensed S ICOIC®I +3) - ne 
4 2 
iER 
bi-level (CG) 2n°SR ne ca) : 
icf 
bi-level (ADMM) g wey Vem g 
TER 
pure ADMM - Feen - 
iER 


no backward communication is needed. Furthermore, for d-ADMM there is 
no need for global communication. 


Similar to d-ADMM, in d-CG the matrix-vector products /;;u ; have to be com- 
municated leading to local communication of n9 Seg |C(i)| floats, where 
nCG is the number of inner d-CG iterations. Additionaly, d-CG needs the 
global communication of two floats per subsystem and per inner d-CG itera- 
tion leading to 2Rn“ floats. Also for d-CG no backward communication is 
required as 4; is already known locally. 


For ADMM directly applied to (2.12), there are variants similar to Algorithm 4 
requiring the exchange of the coupling variables only [Boy+11]. This leads to 
a communication overhead of >) ;<@ |C (i)| floats per ADMM iteration. From 
this, at first glance it seems that ADMM outperforms the ALADIN variants 
by far in terms of communication. However, keep in mind that due to its weak 
convergence rate guarantees, ADMM might need much more iterations in total 
to achieve the same accuacy compared in ALADIN—so for evaluating the total 
amount of communication one has to multiply these numbers by the numbers 
of iterations needed. We will investigate this effect numerically in Section 5.6. 
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Table 4.3: Bi-level ALADIN compared to distributed optimization algorithms. 


primal-dual primal internal dec. ALADIN bi-level ALADIN 


decentralization + ++ -- - + (++) 
convergence guarantees conv conv non-conv  non-conv non-conv 
convergence speed o (--), 0 ++ +++ ++ (+++) 
communication ++ ++ -- ee + 
broad applicability ++ -- + + + 
advanced features + ++ -- t 


Convergence properties and rates 


As shown in Theorem 6, bi-level has similar (local) convergence properties 
than basic ALADIN with one difference: the convergence rate here depends on 
the accuracy in the coordination step and on the quality of the Hessian approx- 
imations. If the accuracy increases and the approximation error decreases fast 
enough in terms of the distance to a local minimizer, quadratic convergence 
rate can be preserved. 


The properties of bi-level ALADIN compared with primal, primal-dual and 
internal decomposition methods are summarized in Table 4.3. Here, one can 
see that with bi-level ALADIN we overcome the two main limitations of basic 
ALADIN: 


* the required amount of communication, 


* and the lack of decentralization. 


At the same time, the favorable convergence properties of ALADIN are to a 
large extend preserved. 


It remains to show how these methods perform in practice. We will do so in 
the next section with main focus on problems from power systems. A further 
example from distributed optimal control is given in Section 6.6. 
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4.7 Summary and conclusion 


In this section we presented a framework called bi-level ALADIN, which is one 
of the first decentralized optimization algorithms for non-convex optimization 
with fast local convergence guarantees. In bi-level ALADIN, we use methods 
from nonlinear programming such as the nullspace method and the Schur- 
complement to substantially reduce the dimension of the coordination QP of 
basic ALADIN. Moreover, we showed how to exploit problem structure to solve 
the coordination problem in a decentralized fashion making bi-level ALADIN a 
decentralized algorithm. We presented two algorithms for doing so—a variant 
of ADMM derived similar to other variants in the literature and a decentralized 
variant of the conjugate gradient method, which is to the best of our knowledge 
novel. The versions of d-ADMM and d-CG given here are improved versions 
of [Eng+20], where ADMM is guaranteed to convergence linearly due to a 
strongly convex formulation of the coordination QP. Moreover, the here used 
variant of the conjugate gradient method omits the precomputing step from 
[Eng+20] leading to a substantial improvement in local communication for 
d-CG. Additionally, we showed that the fast local convergence properties of 
ALADIN are preserved in bi-level ALADIN, if the error in the coordination 
step decays fast enough. 
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This chapter evaluates the performance of ALADIN, bi-level ALADIN and 
ADMM on Optimal Power Flow (OPF) problems—an important problem class 
from power systems. We start with main motivations for using distributed 
optimization methods for OPF. 


We consider the AC power system model from [GS94; Sch17; FR16; FGM18]. 
The problem formulation for distributed OPF is from [Eng+19b; Eng+17]. 
The numerical results of basic ALADIN and ADMM are similar to [Eng+19b; 
Eng+20]. The analysis of the role of a feasible initial point in Section 5.4.3 
is from [EF18]. The analysis of bi-level ALADIN for OPF in Section 5.5 is 
mainly from [Eng+20]. Parts of the convergence analysis of d-ADMM and the 
numerical communication analysis are new in this thesis. 


5.1 How to operate power systems? 


The main purpose of power power systems is to deliver power from generators 
to power consumers. Thereby, the consumer’s demand should be satisfied at 
any time and to an arbitrary amount. In a classical setting, there are no storage 
elements. This implies that the generated power has to match the power 
demand exactly in each step in time. Although power demands can typically 
not be influenced, there is a certain degree of freedom in power generation. 
Namely, one can decide which generator takes which share of total active and 
reactive power demand satisfaction. This can be exploited for a cost-effective 
operation. 


Engineering constraints also have to be considered. Transmission lines for 
example have thermal limits allowing them to carry a limited current only. 
Consumer devices and grid components are only guaranteed to work properly 
in specified ranges around the nominal voltage. Having this in mind, one can 
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ask the following important question: how to choose the generator inputs such 
that 


* all power demands are satisfied; 
* all technical limitations are met; 


* and in a cost-effective fashion? 


One of the most promising ways of approaching this question is to encode 
the above question in an optimization problem called the Optimal Power Flow 
problem. Optimal power flow computes cost-optimal generator set points 
while satisfying power demands and technical limitations. We give a brief 
introduction to OPF in Appendix B. 


New challenges in today's power system 


For formulating an OPF problem, a grid model is required. Grid models, 
however, contain sensitive data. This includes the model data itself but also 
consumption data potentially revealing sensitive information about consumer 
behavior. Moreover, power system are often required to be (N — 1)-secure, 
that is, a single point of failure is not acceptable due to data aggregation and 
the existence of a single point of failure. Moreover, in many countries, there 
are many grid operators which are cooperatively responsible for operating the 
grid in an economically optimal and safe fashion. In Germany for example, 
there are four transmission system operators and more than 800 distribution 
system operators [Bun19], which have to coordinate in some way. The map of 
the German distribution grid operators, Figure 5.1, graphically illustrates the 
dimension of this problem. Here, the question arises how to coordinate these 
grid operators. 


Distributed and especially decentralized optimization algorithms seems to be 
an excellent fit for addressing the above challenges. Distributed optimization 
offers systematic coordination while reducing complexity by shifting com- 
putation overhead to the subsystems (grid operators). At the same time the 
information exchange and centralized coordination are limited. 
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€—X 


Figure 5.1: Distribution grid operators in Germany [ene20]. 
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5.2 Distributed optimal power flow 


In order to apply distributed and decentralized optimization algorithms such 
as ADMM and (bi-level) ALADIN, we have to express the OPF problem 
mathematically in structured form (1.2), 


min SY) fila) subject to x; € X;, IER and 2 Aged. (5.1) 
x iER iER 


Here, the set R represents the set of regions or system operators, f; encodes 
the cost of power generation in each region, and X; captures the grid model 
and technical limitations in each region. 


Due to renewable generation, today electrical grids are often operated close to 
their capacity limits. Thus, one has to consider grid models capturing effects 
which occur in this case such as the effect of reactive power and voltage limits. 
The most common model considering these effects is the so-called AC model, 
which we recall in Appendix B. Therein, we also show how to formulate 
the AC model in form of (5.1). Solving (5.1) via distributed or decentralized 
algorithms is called distributed optimal power flow. 


5.3 Literature review 


Next, we provide an overview on main lines of research for distributed OPF. In 
Chapter 3 we have seen, that most distributed and decentralized optimization 
algorithms are designed for convex problems. Moreover, even if they can 
handle non-convexity, this is typically in the objective function but not in the 
constraints. Unfortunately, in AC optimal power flow problems we have non- 
convex constraints (cf. Appendix B). To cope with this issue, one can identify 
four main lines of research in the literature: 


* Early works propose internal decomposition methods mainly focusing 
on decomposing the KKT system in Newton's method or in interior point 
methods [Com92]. 


* The second line of research staring in the 90s of the last century applies 
primal-dual convex optimization algorithms to OPF [Ers14a; Bal+99; 
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HPK02; SPGI3]. Although they work well in many cases, convergence 
can not be guaranteed and their convergence rate is often slow as we will 
see later in this section. 


A third line of research simplifies the problem constraints obtaining 
convex problems for which primal-dual distributed algorithms exhibit 
convergence guarantees. Approximations like DC optimal power flow or 
semidefinite relaxations belong to this class [Low14a; DZG13a; CA98]. 


As indicated in Chapter 3, few distributed algorithms with convergence 
guarantees for non-convex problems exist and are applied to OPF—two 
of them are ALADIN and bi-level ALADIN [Eng+19b; Eng+20], and a 
third one is an alternating trust-region method [HJ17]. 


Internal decomposition methods 


Let us start with internal decomposition methods as they were among the 
first decomposition methods applied to power flow and also to optimal power 
flow problems. Note that as outline in section Chapter 3, these methods were 
mainly designed for computational speedup coming from limited computation 
power at that time. One of the first works [FP78] considers a combination 
of clustering techniques and block elimination in the KKT system. This 
line of research was extensively investigated in the 1980s, e.g. for power 
system state estimation [WHB81] even in asynchronous schemes [TPG83]. 
Advanced structure-exploiting factorization techniques are used in [VNM83; 
BA86; SAY87; ETA90; OKS90; SVN93], where most of these works are 
summarized in the technical report [Com92]. Some of these block-elimination 
techniques have a similar flavor as the reduction methods we used in Section 4.1 
for bi-level ALADIN, especially the works [FP78; SAY87]. Although these 
methods are transferable to OPF, in their original form they typically address 
power flow or state-estimation problems. More recent examples for internal 
decomposition in interior point methods can be found in [Yan+11] and [LT18], 
where in [LT18] the authors yield very promising numerical results also for 
large-scale systems. Note that all of the before-mentioned methods still require 
central coordination for solving the reduced KKT system and also for other 
operations such as barrier-parameter updates. Thus, these methods offer only 
a limited degree of distribution. 


93 


5 Application to Power Systems 


A second branch works on decomposing the KKT conditions instead of the 
decomposing the KKT system. The main idea is to hold all variables which are 
not involved in the corresponding subsystem fixed and to solve the resulting 
simplified subproblems independently from each other. The method doing so is 
called Optimality Condition Decomposition (OCD), first proposed in [CNPO2; 
Con+06] and further investigated for OPF in [NPC03; HA09]. In [CNP02], 
the convergence rate of OCD is shown to be linear under certain technical 
conditions. One weakness of the above methods is—although they might work 
well in certain practical problems—that the convergence of these methods 
seems not to be fully analyzed at present. In [CNPO2], a sufficient condition 
for convergence of OCD to a first-order stationary point is developed. However, 
to the best of our knowledge, it remains unclear whether this condition holds for 
arbitrary OPF problems [Ers14a]. A similar approach based on decomposing 
the KKT conditions without tuning parameters can be found in [BB06]. 


Remark 16 (Relation to classical Gauss-Seidel methods) Note that also the 
classical Gauss-Seidel method for power flow problems [HO93; Glo! 1] can be 
considered as a nodally “distributed algorithm”. Therein, each node computes 
individual voltage updates based on the current iterates of neighbored nodes. 
However, as the non-convexity of the power flow equations suggests, also 
the classical literture indicates that the Gauss-Seidel method often diverges 
[Gloll, Chap. 6.5]. 


Primal-dual algorithms 


We continue with (convex) primal-dual optimization algorithms based on aug- 
mented Lagragians directly applied to the non-convex AC OPF. Early primal- 
dual methods use dual decomposition [AQC99], ADMM [WSLOI], and re- 
lated augmented Lagrangian methods such as the auxiliary problem principle 
(APP) [Bal+99; HPK02; AKA07] for solving the OPF problem in a distributed 
fashion. APP is evaluated against other augmented Lagrangian methods like 
ADMM and a Predictor-Corrector Proximal Multiplier Method in [KB00]—all 
of them showing a comparable performance. This line of research continues 
until today—especially ADMM [Ers14b], with recently refined parameter- 
update schemes in [Ers15]. Although ADMM converges to modest accuracy 
for medium-sized grids (similar to many other augmented Lagrangian meth- 
ods), recent attempts on large-scale grids indicate that convergence becomes 
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increasingly difficult for with increasng problems size and might work only 
with special partitioning techniques [GHT17]. Moreover, all of these methods 
are in general not guaranteed to converge and divergence seems to occur in 
practice [Chr+17a]. If they converge, they achieve at maximum a linear con- 
vergence rate. But the convergence modulus can be quite large as we shall see 
later. Moreover, if an increasing sequence of the penalty parameter in combi- 
nation with a feasible initialization is used (as done for example in [GHT17]), 
one can show that these methods converge to a feasible but not necessarily 
optimal point (cf. Section 5.4.3). One strength of augmented-Lagrangian and 
similar methods is that they seem to converge quite stable in terms of converg- 
ing to modest accuracy for OPF—at least up to medium-sized grids. Recently, 
an augmented-Lagrangian based scheme for radial grids with guarantees for 
the AC model is presented in [Liu+17]. 


Convex relaxations and approximations 


A different approach uses convex relaxations and approximations of the power 
flow equations yielding convex problems for which the before presented algo- 
rithms are—in contrast to the full AC-OPF problem—guaranteed to converge. 
A classical approach for doing so is based on the DC power flow approximation 
[GS94] to which augmented Lagrangian-based approaches have been applied 
in [LHA94; CA98; BB03; Fei+15]. The advantage of this approach is that 
classical augmented Lagrangian methods converge with guarantees and with 
modest communication requirements for the relaxed problems. However, im- 
portant physical quantities such as voltage magnitudes and the reactive power 
are not considered in the DC model making it non-applicable for many prac- 
tical applications. Especially for distribution grids—which is one of the most 
important field of application for modern OPF methods—the accuracy of the 
DC model can be poor [BZ16; Chr+17a]. Recent approaches combining the 
DC model with modern distributed algorithms such as ALADIN can be found 
in [Jia+20; Jia+19]. Another approach for convex modeling of radial grids is 
the so-called LinDistFlow model [BW89] which is based on the assumption of 
a flat voltage profile. The resulting convex OPF problem is solved via ADMM 
in [SBC14]. However, in view of the large voltage deviations in distribution 
grids cause by renewable power injection, this assumptions hardly holds in 
many practical situations. 
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Table 5.1: Distributed optimization for optimal power flow—state-of-the-art. 


assumptions guarantees pros/cons representative works 
ALADIN good initializa- yes + performance [Eng+17; Eng+19b; 
tion + accuracy MSL19], 
- communication | (ALADIN-DC: [Jia+20; 
E - coordination Jia+19]) 

* ADMM convexity, (feasi- no + coordination [Ers14a; Ers15; KB00; 
= ble initialization, + communication Bal+99; DZGl13a; 
large p) - performance eI 

- accuracy 
ADMM-DC DC power flow yes + simple LHA94; CA98; BB03; 
- no voltages/ Fei+15] 
reactive power 
ADMM-SDP radial grid, (yes) - alg. guarantees [Bai+08; 112: 
phase shifters + elegant theory Mol+13; Low |4a; 
- assumptions DZG13a; Low14b; 
- problem inflation PL17; Chr+17a] 
= OCD technical condi- unclear + decentralization [NPCO3; HA09; 
s tion + communication HKW15] 
2 - scalability, guar- 
antees? 
internal de- - from + convergence of [FP78; TPG83; VNM83; 
composition “host” host algorithm BA86; SAY87; ETA90; 
algorithm + scalability OKS90; SVN93; 
- coordination Com92], [LT18; KFS18; 
- decentralization Yan+11] 
alternating yes + high accuracy [HJ17] 
trust region - tested for small 
grids only 
“augmented convexity no + coordination [Bal+99; KB00; HPK02] 
Lagrangain- + communication 
like” algo- - performance 
rithms (APP, - accuracy 
PCPM) 
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Instead of the DC approximation, a related line of research employs convex 
relaxations ofthe power flow equations via Semi-Definite Programming (SDP). 
Therein, the idea is to lift the OPF problem to a higher dimensional space in 
which it becomes convex if a certain rank constraint is dropped [Bai+08; 
Mol+13]. This relaxed and inflated problem is then solved by a distributed or 
decentralized convex optimization algorithm obtaining convergence guarantees 
[DZG13b; PL17; Chr+17b]. The crux in this approach is that the solution of 
the relaxed problem might not satisfy the before-mentioned rank condition and 
thus the solution of the original OPF might not be recoverable from the relaxed 
problem. Under certain assumptions on the technical equipment such as small 
transformer resistances or a radial grid topology, e.g. radial grids [LL12; 
Low14b; Low14a] this “back-mapping” is guaranteed to exist, although these 
assumptions pose severe limitations on the class of problems to which this 
approach is applicable. However, there is an ongoing debate whether the 
assumptions made are realistic [Chr+17a] and, moreover, one can show that 
the recovering the solution from an SDP relaxation might fail even for very 
small systems [KDS16]. Furthermore, the number of decision variables when 
using SDPs is squared leading to very large SDPs (especially for large systems) 
which can lead to tractability issues [Cap16]. 


Recent non-convex distributed algorithms 


Recently, new algorithms designed for non-convex distributed nonlinear pro- 
gramming have been proposed which are able to handle the AC model directly 
and with guarantees without relying on relaxations or approximations. One 
of them is based on an alternating trust-region approach using a decentralized 
variant of conjugate gradients and alternating projection methods with conver- 
gence guarantees at linear rate for linearly-constrained non-convex problems 
[HJ17]. The approach was successfully applied to small OPF problems rang- 
ing from 9 to 56 buses. A second approach is based on ALADIN which we 
will present in this chapter [Eng-- 19b]. 


An overview of the previously outlined lines of research with main strengths 
and weaknesses of each line is given in Table 5.1. For more detailed overviews 
on general and distributed methods for OPF we refer to [Mol+17; Cap16]. 
Note that we consider ADMM and ALADIN to be in *this thesis" in Table 5.1 
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Table 5.2: Grid partitioning (excluding auxiliary buses). 


Test Case |A| Regions N; 


5 4 {1,5}, {2,3}, {4} 
118 13. (1-32, 113-115, 117}, {33-67}, (68-81, 116, 118}, {82-112} 


in the sense that we compare the performance of these two algorithms in this 
work. 


Remark 17 (Related OPF problems) Considering problems with integer vari- 
ables such as certain versions of the reactive power dispatch problem is beyond 
the scope of the present work. For first distributed approaches in this direction 
we refer to [Mur+18; Mur+19]. Moreover, note that although our approaches 
is also applicable to temporal decomposition, we focus on spatial decompo- 
sition here. Examples for temporal decomposition in OPF with storage or 
generator ramp constraints can be found for example in [KFS18; AC00; XSOI; 
CCD05]. Therein, similar to internal decomposition methods, the main goal 
is computational tractability. 


Note that these problem formulations have strong interconnections to the theory 
of economic model predictive control [Fau+18] originally developed for the 
optimal control of time-invariant dynamical systems. First attempts connecting 
these two worlds and transferring latest methodological developments from this 
field to power systems are proposed in [FGM18; FE19; Köh+17]. 


5.4 ADMM for Optimal Power Flow 


Now, let us apply distributed optimization algorithms from Section 2.2 to OPF. 
We start with ADMM for two reasons: first, as outlined in Section 5.3, ADMM 
is one of the most prominent methods for distributed optimization in general 
and particularly for distributed OPF [Ers14b; KB00; GHT17]. Secondly, we 
will use ADMM as a benchmark algorithm for evaluating the performance of 
ALADIN and the bi-level ALADIN scheme. 


98 


5.4 ADMM for Optimal Power Flow 


(b) Modified 5-bus system from [LB 10]. 


Figure 5.2: Numerical examples. 
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Throughout this chapter, we use a 5-bus system from [LB10] as a tutorial 
example and the IEEE 118-bus as a more realistic test system as numerical ex- 
amples. Both systems are shown in Figure 5.2. The corresponding partitioning 
are given in Table 5.2. In al cases, we tune ADMM for best performance. Note 
that the 5-bus system is particularly designed to be hard: we chose a partition- 
ing in which a large amount of power has to be exchanged between region one 
(as mainly generating region) and the other two regions (as mainly demanding 
regions). Moreover, we put line limits particularly on lines connecting regions 
which will lead to difficulties—especially for ADMM as we shall see later. 
Results for ALADIN and ADMM for further benchmark systems up to 300 
buses can be found in [Eng+19b]. 


5.4.1 Typical numerical results 


Next, we show typical numerical results when applying ADMM to OPF prob- 
lems. Figure 5.3 shows the convergence behavior of ADMM for the 5-bus 
example from Figure 5.2b for three different values of the tuning parameter 
p^PM e (10?, 10°, 104} neglecting line limits. The upper subfigures depict the 
values of all active power injections pê and reactive power injections qê over the 
iteration index k € N.! To characterize convergence, we consider three differ- 
ent metrics: the consensus violation || Ax* ||% at the current iterate x^, charac- 
terizing the maximum mismatch of active/reactive power or voltage at coupling 
nodes; the distance to the minimizer x*, ||x* —x*|[..;? the distance of the current 
objective function value f^ := f(x^) to a local minimum f* := f(x*), and 
r= Vi )+ ier yF'Vai(x*) + Mjen pF'N h;(xF) + AF (Zier Aixk - b) 
for (2.12) measuring the degree of stationarity (dual feasibility) in the KKT 
conditions (2.2a). Note that we use the infinity norm || - ||.) here for scale- 
independence. 


One can see that with proper tuning of p^PM, ADMM is able to reach a 
medium accuracy of 107? ... 107? within about 100 iterations. This is typical 
for ADMM and also observed in many other applications [Boy+11]. 


! We use the standard per-unit (p.u.) normalization system here, cf. [GS94] for details. 
? We compute the reference solution x* via the centralized interior-point solver IPOPT. 
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Figure 5.3: ADMM for the 5-bus system for p^PM e (102, 10°, 107} . 


Remark 18 (Multiple local minimizers) Note that as OPF is non-convex, there 
are possibly multiple local minimizers [Buk+13; LW15]. However, practical 
experience shows that in most cases there is only one technically meaningful 
high-voltage solution and local solvers converge to that solution if initialized 
properly [Cap16; Mom+97]. Here, all algorithms converged to the same local 
minimum for the standard initialization (every value zero except for the voltage 
magnitude which is initialized with 1, flat start). The theoretical foundations 
of existence and uniqueness of solutions to the power flow equations is subject 
to ongoing and future work [CS19; Sim18; MH19]. 


Scaling-up to 118 Buses 


Scaling up to large grids is often hard. Different numerical issues are more 
likely to occur with a higher dimension, for example linearly-dependent con- 
straint linearizations and finding an initial point close to a local minimizer 
might also be harder. Figure 5.4 shows the numerical performance of ADMM 
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Figure 5.4: ADMM for the IEEE 118-bus system with oAPM e {107, 10°, 10*}, and pt at nodes 
k € (10, 25, 26, 65}. 


for the IEEE 118-bus system from Figure 5.2a. One can see that ADMM needs 
about 150 iteration to converge to a modest accuracy of 107? in the distance 
to the local minimizer, which is round about twice as much compared to the 
5-bus system. 


5.4.2 The role of special constraints 


ADMM works quite well im many settings such as for the 5-bus or 118-bus OPF 
problem. However, next we show a practical example, where the convergence 
to the same level of accuracy is at least a factor of 10 larger compared with the 
before mentioned cases solely by adding line limits to the OPF problem. Note 
that this modification does neither change the problem class (non-convex NLP), 
nor the problem size. The line limits are indicated as red bars in Figure 5.2. 
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Figure 5.5: ADMM for the 5-bus system for p^PM e (10?, 10°, 10*} and active line limits. 


The numerical results are shown in Figure 5.5. Here, ADMM needs about 
1.500 iterations to reach the same level of accuracy of around 107?. 


5.4.3 The role of a feasible initialization 


Recent works applying ADMM to OPF problems suggest using increasing 
sequences for the penalty parameter p^PM [Ers15]. In [GHT17] this rule is 
combine with a feasible initial point for large test systems. Next we investigate 
the practical and theoretical implications of these modifications. 


Figure 5.6 shows the convergence behavior of ADMM for a very large value of 
p^PM = 10° with and without initialization at a feasible initialization. One can 
see that with a feasible initialization, ADMM stays at the initial point which is 
feasible, but the gap of the cost to the optimal one f^ — f* is large and does 
not decay to zero. This means that there is insufficient progress in terms of 
optimality. This can be explained by the role of the penalization parameter in 
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Figure 5.6: ADMM for the 5-bus system and p^PM = 10°, active line limits initialized with 
(yellow) and without (red) a feasible initial point. 


ADMM (Algorithm 4): in case p^PM is large, the constraints g; and h; are 
satisfied and as A; has full row-rank, ALADIN will return an = z% in step 1) 


forced by the quadratic penalization term £ - is lA; (xi — z5) R One can think 
of the following two steps 2) and 3) as a kind of averaging step of the coupling 
variables, which will also not change if the variables a do not change. Hence, 


ADMM sets stuck at a feasible point in this case. 


Mathematical analysis 
Next, we analyze this effect also mathematically for ADMM in case of b = 0, 


which we have for OPF (cf. Appendix B). Note that in the analysis we use o 
instead of p^PM for simplified notation. 


Proposition 1 (Feasibility and p — œ implies aU —xk € null(A;)) Consider 
the application of ADMM (Algorithm 4) to problem (2.12) for b = 0. Suppose 
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that, for all k € N, the local problems in step 1) have unique minimizers a 
fulfilling the LICQ. For k € N, let A* be bounded and z% € User X; NC. 


Then, the ADMM iterates satisfy p" = xk 


Proof. The proof is divided into four steps. Steps 1)-3) establish technical 
properties used to derive the above assertion in Step 4). 
Step 1). At iteration k, the local steps of ADMM are 
k l du p P 
xf(p) = argmin fi) + (E) Amie alu (u-)L. e» 
Xi EX; 2 2 


Now, by assumption, all f; are twice continuously differentiable (hence 
bounded on Xj), ae is bounded and all ze € Xi. Thus, for all i € R, 
lim xF (p) = zk + vk with v* € null(A;). 

pow 


Step 2).The first-order stationarity condition of (5.2) can be written as 


-V fix) - y vh;(xF) = ATE + pA] A; E. = d (5.3) 


i 


where ykT is the multiplier associated to h;. Multiplying the multiplier update 
formula from step 3) with A] from the left we obtain AJ AF! = AT AK + 


pA] Ai(x* — zk). Combined with (5.3) this yields ATA! = -V f(x) - 
yFTV h(xF). By differentiability of f; and h; and regularity of a this implies 
boundedness of A7 AF*!. 

Step 3). Next, we show by contradiction that Ah € null(A;) for all i € Rand 
p oo. By expressing z; as z; = xe + Ax; in step 2) of ADMM, the update 


reads 


min Y Pax} A] Aix; - APT Aj; st. M Ai + Ax) =0. (64) 
is ief 2 IER 


Observe that any Axk € null(A;) is a feasible point to (5.4) as Mier Aixk =0 
by step 1) of this proof and by feasibility of a, Consider a feasible can- 
didate solution Ax; € null(A;) for which Y;cr A; (xk + Ax;) = 0. Clearly, 
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AF? Aj Ax;(p) will be bounded. Hence for a sufficiently large value of p, 
the objective of (5.4) will be positive. However, for any Ax; € null(A;) the 
objective of (5.4) is zero, which contradicts optimality of the candidate so- 
lution Ax; ¢ null(A;). Hence, choosing p sufficiently large ensures that any 
minimizer of (5.4) lies in null( A;). 


Step 4). It remains to show gp - x Recall that we used z**! = x* + Ax* in 


k 


the previous step. Given Steps 1-3) this yields ght = 2k + yk + Ax* and hence 


- o2 "EN AJ Ajj 
le (e=) = [Ai meu [Ai -h 


Observe that this implies that, for p — oo, problem (5.2) does not change from 
step k to k + 1 by step 3) of Algorithm 4 as A**! = A* since by step 3) of this 
proof also x**! — ee lies in the nullspace of A;. This proves the assertion. o 


i 


The above proposition shows that the problems in step 1) of Algorithm 4 and 
also ak do not change for subsequent iterates, once ADMM is at a feasible 
point together with p — co. With the assumption that the local solvers are 
deterministic, i.e. they yield the same solution for same problem data this 
implies that ADMM stays at this feasible point, cf. Corollary | in [EF18]. 


Termination by consensus violation and alternating projections 


A second important insight is that a small consensus violation || Ax* || does not 
necessarily imply that the algorithm is close to a local minimizer. Otherwise, 
ADMM would terminate immediately in case of a feasible initial guess and a 
high penalization parameter. It tells something of the feasibility of the current 
step, but the objective function value f* might be far from being optimal. 
This is especially relevant here as ||Ax^|| is sometimes used as termination 
criterion for ADMM for OPF in the literature [Ers15; Ers14a]. In Appendix C 
we provide two additional analytical examples illustrating the above behavior. 


In the opposite case of an infeasible initial guess in combination with a very 
high penalization, ADMM starts projecting the iterates x* and z* back and 
forth between the consensus constraint set C and the local nonlinear constraints 
(power flow equations and bounds) X;, cf. Algorithm 4. This can for example 
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Table 5.3: ALADIN parameters 


y? ry Y p? Tp p 


5-bus 10? 1.1 108 10^ 2 2.109 
118-bus 10? 1.1 108 10? 2 2.106 


be seen by the large values of ||Ax*|| in Figure 5.6. This is similar to the 
examples for alternating projections from Section 2.3.1. 


5.5 ALADIN for Optimal Power Flow 


Next, we investigate the convergence behavior of ALADIN (Algorithm 5) and 
bi-level ALADIN (Algorithm 8) for OPF. In all cases, we initialize ALADIN 
with a flat start”, we use the regularization technique from [Eng20a] and 
ALADIN parameters given in Table 5.3. Moreover, we use diagonal scaling 
matrices X with entries of 10? for voltage angles and magnitudes and entries 1 
for active and reactive power injection. 


Figure 5.7 shows the convergence of ALADIN for the 5-bus system with and 
without line limits, and for the IEEE 118-bus system. One can see that for all 
cases, ALADIN converges quite fast and almost independently of the problem 
size and the considered constraints (with/without line limits) in less than 20 it- 
erations to a very high accuracy in all indicators. Observe that here we achieve 
accuracies of about 107? whereas for ADMM we achieved 107? . . . 107?. This 
indicates that ALADIN is less scale-dependent and problem dependent com- 
pared with ADMM —likely a consequence of the relationship to SQP methods 
and because of constraint information in the coordination step. Moreover, 
note that in all cases we use a flat-start initialization—hence ALADIN does 
not rely on any sort of feasible initial guess. The large improvements in the 
accuracy of ALADIN compared with ADMM can especially be observed in 
the convergence of reactive power injections q8: whereas after 15 iterations 
they are more or less converged for ALADIN in the 5-bus system (Figure 5.7 


? In the OPF context, a flat start is the initial guess where the voltage magnitudes are initialized 
with 1 p.u. and all other values with 0. 
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Figure 5.7: ALADIN for the 5-bus system without line limits (yellow), with line limits (red) and 
for the 118-bus system (blue) with flat start. 


red lines), for ADMM even after 1.500 itertions there seem to be significant 
changes regardless of the parameter p^PM, cf. Figure 5.5. 


5.5.1 Bi-level distributed ALADIN 


Next, we investigate convergence of bi-level ALADIN for the 118-bus OPF 
case. For b-CG we use a fixed number of 70 inner iterations and for b- 
ADMM we use 60, 100 and 200 inner iterations. Both variants require a 
minimum number of inner iterations (70 for b-CG and 60 for b-ADMM) to 
converge—i.e. to fulfill the accuracy requirement from inequality (4.33). We 
use T ud = 107? as the optimal penalization parameter for inner ADMM. For 
both inner algorithms, we use warm-starting. Pure ADMM uses p^PM = 104 
which was the optimal value achieving best overall performance. 
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Figure 5.8: ALADIN (yellow), bi-level ALADIN with CG (dashed blue), bi-level ALADIN with 
ADMM (dashed orange) and ADMM (purple) for the 118-bus system. 


Figure 5.8 shows the convergence behavior of basic ALADIN, b-ADMM, b- 
CG and pure ADMM. The graphs for ADMM and for basic ALADIN are the 
same as before but displayed for the sake of comparison. One can see that b-CG 
reaches almost the same overall performance compared with basic ALADIN. 
This is interesting as CG also provides a limited accuracy only in the inner 
step. 


For b-ADMM on the other hand, this is different: the achievable accuracy 
of the whole algorithm seems to depend on the number of inner iterations in 
ADMM. This can be seen that ||Ax*||.; and ||x* — x*||.. stop decaying after 
around 10 outer iterations in Figure 5.8. Another difficulty when using ADMM 
as inner algorithm is tuning. We will investigate these two effects next. 
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Figure 5.9: Accuracy of inner ADMM after 100 iterations for different values of piin and after 


k € (1,4, 16} outer iterations of bi-level ALADIN for 118-bus OPF. 


5.5.2 Tuning inner ADMM 


ADMM as an inner algorithm requires a serious amount of tuning. Figure 5.9 
shows the accuracy in the inner problem (4.9), || $^ 4" —$* ||», over the parameter 
fo for a fixed amount of 100 inner d-ADMM iterations after k € {1,4, 16} 
outer bi-level ALADIN iterations. One can see, that for the pair (S i 5), there 
are two different optimal values for pM located at two different regions at 
around 107? and at around 10° after one outer iteration. This is one reason 
making tuning difficult. 


Moreover, the optimal pM changes significantly while ALADIN proceeds: 


after four outer ALADIN iterations, i.e. for (S^. 34), the optimal value is 
located at about 10° and after 16 iterations it is located at around 10? to 10°. 
This is a second reason making tuning of inner ADMM algorithm extremely 
difficult since this requires that ADMM has to be re-tuned after each outer 
ALADIN iteration. 


A similar conclusion can be drawn from Figure 5.10. This figure shows the 
residual of the coordination step (4.9), ||S*a” — 5*||«, for d-ADMM (Algo- 
rithm 6) and for d-CG (Algorithm 7) in the first ALADIN iteration (left) and 
after 16 ALADIN iterations (right). Also from this figure one can conclude 
that the optimal parameter for inner ADMM changes by four orders of magni- 
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Figure 5.10: Convergence of d-CG and d-ADMM for the 118-bus OPF problem different values 
of pap after one outer ALADIN iteration (left) and after 16 outer iterations (right). 


tude from about 107? in the beginning to 10°. Moreover one can see another 
interesting effect which is a possible explanation for the limited accuracy from 
Figure 5.10: whereas d-ADMM converges to a level of a bout 107? in the 
beginning of the ALADIN iterates, in later iterations the achievable accuracy 
seems to be limited to 107? Note that this can not be influenced by tuning of 
Du We tried different tuning parameters and pee = 10? showed the best 
convergence and all other tuning parameters led to worse results. This effect 
seems interesting not to be investigated in the literature so far. In Appendix C 
we elaborate on this effect and we present an example, where this effect occurs 
already for very small problems. 


For d-CG, in contrast, this effect does not occur and d-CG converges to high 
accuracies in all cases. Moreover, d-CG is parameter-free and does thus not 
require any tuning. 


Remark 19 (Preconditioning for d-ADMM/d-CG) By preconditioning of (4.9), 
one can significantly accelerate convergence of d-ADMM and d-CG [GB17; 
Ste+20; Saa03]. However, note that preconditioning is also a centralized opera- 
tion in general, where it has to be investigated whether the additional communi- 
cation/coordination effort is outweighed by less iterations in dc ADMM/d-CG. 
In applications, where, the coefficient matrix $^ does not change significantly 
from one step to another, offline preconditioning seems promising (e.g. in 
a control setting [Sta+16]). However, as we have shown previously, the co- 
efficient matrix in bi-level ALADIN changes significantly and thus offline 
preconditioning seems less promising—in view of changes in the active set but 
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Table 5.4: Total forward communication until convergence (floats) for 118-bus OPF. 


lobal local lobal local 
ALADIN variant 59)? 00A MN RE 


ó = 107 ó -107* 
basic 852,992 O0 1,066,240 0 
condensed 13,152 0 16,440 0 
bi-level (CG) 1,120 11,200 1,400 14,000 
bi-level (ADMM-100) 0 24,000 - - 
bi-level (ADMM-200) 0 36,000 - - 
pure ADMM 0 2,840 0 6,320 


also in view of the fact that curvature information changes in each ALADIN 
outer iteration. 


5.6 Comparing communication 


Next, we compare the forward communication from Section 4.6 for all algo- 
rithms numerically. Recall that in Section 4.6 we analyzed the communication 
effort per iteration. Here we analyze the total amount of communication, i.e. 
the number of floats which have to be communicated in sum until a pre-defined 
accuracy in the decision variables is reached. So here the “speed” of the algo- 
rithm plays a significant role since the amount of per-step communication is 
multiplied by the total number of iterations. We compare the algorithms for a 
given accuracy in the primal variables 6 = ||x* — x* |o. 


Table 5.4 and Table 5.5 show the total amount of forward communication 
for all investigated distributed optimization algorithms for the 118-bus OPF 
(Table 5.4) example and the 5-bus OPF example with line limits (Table 5.5). 
Hereby distinguish two different values for the accuracy ó for both grids: a 
low accuracy case and a high accuracy case. As predicted in Section 4.6, 
the condensed ALADIN variant significantly reduces the amount of required 
communication. For the 118-bus system, the reduction is quite large with 
a factor of round about 60 and for the 5-bus system with a factor of 5 a bit 
smaller. Note that the reduction here mainly depends on the number of coupling 
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Table 5.5: Total forward communication until convergence (floats) for 5-bus OPF with line limits. 


ALADIN variant global local global local 


ó - 107! ó 2107 
basic 13,368 0 14,482 0 
condensed 2,688 0 2,912 0 


bi-level (CG) 1,820 10,920 2,240 13,440 
bi-level (ADMM- 100) - = - = 
bi-level (ADMM-200) - - - s 


pure ADMM 0 11,580 - - 


variables n, in relation to the number of decision variables nx. For the 5-bus 
system this ration is quite large making the reduction in communication smaller. 


Moreover, one can observe that bi-level ALADIN with CG requires round 
about the same communication for the 118-bus system as the reduced space 
variant—but now this is mainly local communication. For the 5-bus system, 
b-CG requires slightly more communication than condensed ALADIN. b- 
ADMM needs an about a factor 2-3 larger amount of local communication 
mainly due to the slower convergence in the inner problem compared with 
b-CG. 


Pure ADMM requires less communication compared with all other variants in 
the low-accuracy case (6 = 107?) for the 118-bus system. However, as ADMM 
typically becomes quite slow in later iterations, the gap becomes smaller with 
higher accuracy requirements. For an accuracy of ö = 107^ for example, the 
reduction factor of ADMM is only 2 instead of 5 for 6 = 107°. 


For the 5-bus system with line limits (Table 5.5) this effect is extreme: as 
ADMM becomes very slow here (cf. Figure 5.5), ADMM needs more com- 
munication than b-CG. Moreover, the achievable accuracy of pure ADMM 
within 2,000 iterations is no more than 10-2. Moreover, b-ADMM did not 
converge at all for this case since the achiveable accuracy in the inner problem 
was not high enough to achieve an accuracy of 107!. 
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5.7 Summary and conclusion 


The previous sections showed that ADMM can be applied successfully to 
OPF problems and converges quite robustly in many cases. However, ADMM 
has in general no convergence guarantee and divergence seems to occur in 
practice [Chr-17a]. Moreover, often convergence is possible only up to a 
limited accuracy and ADMM might require a large number of iterations to 
achieve even that accuracy. Although this level of accuracy is often sufficient 
in an OPF context as many parameters such as the power demands are anyways 
known only up to a limited certainty [Cap16], one has to be aware of the fact that 
a consensus violation ||Ax|| # 0 always means that the solution is infeasible 
to some degree. This implies that active/reactive power at interconnection 
points do not match perfectly and when applying ADMM to real systems and 
one has to account for that with and underlying controllers compensating this 
mismatch. Furthermore, we showed that the performance of ADMM can vary 
greatly depending on the type of constraints considered (with or without line 
limits for example). 


We also showed that a feasible initialization in combination with high pe- 
nalization parameters leads to feasible but not necessarily optimal solutions 
for ADMM. Moreover, we concluded that that consensus violation (|| Ax* ||) 
alone, as sometimes used in the literature [GHT17; Ers15], is in general insuf- 
ficient for termination. Especially in combination with the above combination 
of a feasible initialization with a feasible initial point this can lead to an ar- 
bitrary fast termination of ADMM. In this case, one can essentially enforce 
convergence of ADMM in one step in this case by choosing a sufficiently large 
p although the current iterate is far from a local minimizer. Moreover, the as- 
sumption of a feasible initial guess seems strong. Finding a feasible point has 
about the same complexity as the full OPF problem itself and again requires a 
strong central coordinator jeopardizing the goals of distributed optimization. 


Applying ALADIN variants 


As an alternative to ADMM, we proposed different ALADIN variants. AL- 
ADIN has the appealing property of a theoretically very fast local convergence, 
which we also observed in practice. One drawback of ALADIN are its compa- 
rably large communication and coordination requirements. As an alternative, 
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Table 5.6: ADMM, ALADIN and bi-level ALADIN for AC OPF. 


guarantees speed communication coordination robustness accuracy 


ADMM -- o + ++ + 
ALADIN ++ ++ -- - o ++ 
bi-level ALADIN ++ ++ + + o ++ 


we proposed a condensed variant of ALADIN, substantially reducing coordi- 
nation and communication requirements. Moreover, with bi-level ALADIN, 
decentralized is also possible, while preserving ALADIN’s fast local conver- 
gence properties. We showed that the performance of bi-level ALADIN with 
conjugate gradients (b-CG) is almost indistinguishable from basic ALADIN’s 
performance although conjugate gradients introduce inexactness in the co- 
ordination step. For b-ADMM this is different: the achievable accuracy of 
b-ADMM seems to depend on the number of inner ADMM iterations. This 
comes from the fact that ADMM is able to solve the coordination problem up 
to a certain accuracy only and this effect can not be overcome by tuning. We 
also showed that this effect occurs already for very small-scale problems (cf. 
Appendix C). Moreover, tuning of inner ADMM can be difficult. However, 
b-ADMM offers full decentralization in the sense that it does not require any 
form of global scalar sum such as b-CG. The convergence rate of all the before 
mentioned ALADIN variants seems to be independent from the problem size 
and also from the considered constraints, which was not the case for ADMM. 
For large problems tuning of all ALADIN variants and also ADMM becomes 
increasingly difficult. 


The properties of ADMM, ALADIN, and bi-level ALADIN are summarized 
in Table 5.6. 


Choosing an algorithm 


Choosing an appropriate algorithm is difficult. Whereas ADMM works well 
in many cases, convergence might be very slow and thus ADMM might take 
a long time to converge. The ALADIN variants usually converge very fast 
and decentralization is possible via bi-level ALADIN although tuning might 
become difficult for larger grids. If a high accuracy is required, ALADIN is 
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certainly first-choice. Moreover, ALADIN is guaranteed to converge for an 
initial point close enough to a local minimizer, which ADMM is not. In terms 
of total communication, ADMM has often a slightly lower footprint. 
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In this chapter, we present an open-source MATLAB toolbox, ALADIN- 
a, implementing ALADIN, bi-level ALADIN and ADMM with a unified 
interface. 


Major parts of this chapter are from [Eng+20]. We limit ourself to a brief 
overview here, for a more detailed description of options and subroutines we 
refer to [Eng+20] and to the documentation website [Eng20b]. 


6.1 Features of ALADIN-a 


ALADIN-a is intended for rapid prototyping of distributed and decentralized 
optimization algorithms and aims at user-friendliness. The only user-provided 
information are objective and constraint functions— derivatives and numerical 
solvers are generated automatically by algorithmic differentiation routines and 
external state-of-the-art NLP solvers. A rich set of examples coming with 
ALADIN-a covers problems from robotics, power systems, sensor networks 
and chemical engineering underpinning its application potential. 


ALADIN-a supports 


* bi-level ALADIN and the condensed variant of ALADIN from Chap- 
ter 4; 


* a BFGS Hessian extension from [Eng+19b]; 
* parametric NLPs enabling distributed Model Predictive Control (MPC); 


* heuristics for regularization and parameter tuning. 


The ADMM variant from Algorithm 4 is also included. ALADIN-« can 
be executed in parallel mode via the MATLAB parallel computing toolbox. 
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This can lead to a substantial speed-up, for example, in distributed estimation 
problems. However, although distributed optimization can be used for parallel 
computing, computation speed or real-time feasibility is currently not a primary 
focus of the toolbox. A documentation and many application examples of 
ALADIN-a are available under [Eng20b]. 


6.2 Literature review 


Despite the various practical applications of distributed optimization, only 
very few software toolboxes are freely available. Even if one can find one of 
the rare examples, these tools are typically tailored to a specific application at 
hand and typically address convex problems. Examples are several versions of 
ADMM applied to a plethora of applications summarized in [Boy+11], with 
code available under [Boy+20]. An implementation of ADMM for consen- 
sus problems can be found under [Mot+20]. A tailored implementation of 
ADMM for OPF problems using an algorithm from [GHT17] can be found 
under [Guo20]. However, there is a lack of multi-purpose software tools for 
distributed optimization and we were not able to find any open-source imple- 
mentations for generic distributed non-convex optimization problems. Also 
with respect to decentralized non-convex optimization, we were not able to 
find any publicly available code. 


However, for parallel optimization efficient structure-exploiting tools exist. A 
closed-source parallel interior point software is OOPS [GGO7]. The open- 
source package qpDunes is tailored towards parallel solutions of QPs arising 
in model predictive control [FSD15]. For general QPs, the partially paralleliz- 
able solver OSQP seems promising [Ste+20]. PIPS is a collection of algorithms 
solving structured linear programs, QPs, and general NLPs in parallel [CPZ14; 
Lub+11]. The software HiOp is tailored towards structured and very large-scale 
NLPs with few nonlinear constraints based on interior point methods [Pet19; 
PCA19]. Moreover, combining parallel linear algebra routines (e.g. PARDISO 
[Sch+01]) with standard nonlinear programming solvers (e.g. IPOPT [WB06]) 
also leads to partially parallel algorithms [Cur+12; KFS18]. All these tools 
are implemented in low-level languages such as C or C++ leading to a high 
computational performance. However, their focus is mainly on computational 
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speedup via parallel computing rather than distributed and decentralized opti- 
mization in a multi-agent setting. 


6.3 Parametric problem setup 


ALADIN-a solves structured optimization problems of the form 


min >)! fi(xi, po) (6.12) 
Teens GER 
subject to gi(xi, pi) = 0 | Kis VER, (6.1b) 
hi(xi, pi) < 0 | Yi Vie f, (6.1c) 
Xi E xi € Xj | Nis Vi € R, (6.1d) 
2 Aix; = b IA, (6.1e) 
TER 


which is a slight modification of (2.12) introducing box constraints separately. 
We do so for numerical efficiency reasons—some local solvers are able to 
handle box constraints by specialized routines. Moreover, problem (6.1) allows 
for parameter vectors p; € R’'ri. This can be useful for example in Model 
Predictive Control or if one would like to solve the same distributed problem 
multiple times for a variety of parameters. 


6.4 Software structure 


6.4.1 Code structure 


In order to avoid side-effects and to make code-modification easy for beginners, 
we choose a procedual/functional programming style. We decided to imple- 
ment all core features in MATLAB to enable easy rapid-prototyping. The over- 
all structure of run. ALADIN () —which is the main function of ALADIN-a—is 
shown in Figure 6.1. First, a preprocessing step performs a consistency check 
of the input data and provides default options. The createLocSolAndSens () 
function initializes the local parameterized NLPs and sensitivities for all sub- 
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a) preprocessing checkInput(), setDefaultOpts() 
b) problem/sensitivity setup createLocSolAndSens() 
ALADIN main loop iterateAL() 
©% c)solve local NLPs parallelStep(), BFGS(), 
E d) evaluate sensitivities parallelStepInnerLoop(), 
[of 
‚3 e) Hessian approx. /regularization updateParam(), regularizeH() 
f) solve the coordination QP createCoordQP(), solveQP() ,solveQPdec() 
g) compute primal/dual step computeALstep() 
h) postprocessing displaySummary(), displayTimers() 


Figure 6.1: Structure of run_ALADINC) in ALADIN-a. 


problems i € R. For constructing the local NLPs and sensitivities, we use 
CasADi due to its algorithmic differentiation features and the possibility to 
interface many state-of-the-art NLP solvers such as IPOPT [WB06; And+19]. 
CasADi itself relies on pre-compiled code making function and derivative 
evaluation fast. A reuse option allows reusing the CasADi problem setup. 
When the reuse mode is activated (e.g. when ALADIN-a is used within an 
MPC loop), createLocSolAndSens() is skipped resulting in a significant 
speed-up for larger problems. 


In the ALADIN main loop iterateALO, the function parallelStepQ 
solves the local NLPs and evaluates the Hessian of the Lagrangian (or it's 
approximation e.g. when BFGS is used), the gradient of the objective, and the 
Jacobian of the active constraints (sensitivities) at the NLP's solution. Regular- 
ization is done in parallelStep() if needed. Moreover, in case the nullspace 
method or bi-level ALADIN is used, the computation of the nullspcaces and 
the Schur-complements is done locally shifting substantial computational bur- 
den from the centralized coordination step to parallelStep(). The function 
updateParam() computes dynamically changing ALADIN parameters for 
numerical stability and speedup. 


The coordination QP is constructed in the function createCoordQP(). We 
use problem (2.25) including slack variables for numerical stability. Differ- 
ent dense and sparse solvers for solving the coordination QP are available 
in solveQP(). Most of them are based on solving the first-order necessary 
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sProb 7 locFuns y ffi res = run ALADIN(sProb, opts) —» "°S xxOpt 
llbx I ggi 4 4 lamOpt 
uubx hhi iter 
AA Gips ili eom timers 


Figure 6.2: The sProb data structure for defining problems in form of (6.1). 


conditions which is a linear system of equations. Available solvers are the 
MATLAB linear-algebra routines linsolve(), pinv(), the backslash op- 
erator and MA57 based on the MATLAB LDL-decomposition routine. Using 
sparse solvers can speed up the computation time substantially. Note that only 
MA57 and the backslash-operator support sparse matrices. The solver can be 
specified by setting the solveQP option. In case of convergence problems 
from remote starting points, it can help to reduce the primal-dual stepsize of 
the QP step by setting the stepSize in the options to a values smaller than 1. 


6.4.2 Data structures 


The main data structure for defining problems in form of (6.1) is a struct called 
sProb (cf. Figure 6.2). In this data structure, the objective functions { fi}icr 
and constraint functions {g;}ier and {h;}ier are collected in cells which are 
contained in a struct called locFuns. Furthermore, sProb collects lower/upper 
bounds (6.1d) in cells called 11bx and uubx. The coupling matrices (Aj) ;e« 
are collected in AA. Optionally, one can provide NLP solvers and sensitivities— 
in this case the problem setup in createLocSolAndSens () is skipped leading 
to a substantial speedup in runtime for larger problems. Optionally one specify 
initial guesses in zz and initial Lagrange multipliers 1am8. The second 
ingredient for ALADIN-a is an opts struct. There, one can specify the 
variant of ALADIN-a and algorithmic parameters. A full list of options with 
descriptions can be found under [Eng20b]. 


ALADIN-a returns a struct as output. This cell contains a cell of locally 
primal optimal solutions xxOpt with (x*);eg. lamOpt are the optimal la- 
grange multipliers for the consensus constraints (6.1e), A*. Moreover the field 
iter contains information about the ALADIN iterates such as primal/dual 
iterates and timers contains timing information. Note that run. ALADIN() 
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and run_ADMMC) have the same function signature in terms of sProb—only 
the options differ. 


6.5 Atutorial example 


Next, we provide two numerical examples illustrating how to use ALADIN-« 
in practice. The first one is a minimalistic example showing how to formulate 
problems in form of (6.1) and solving this problem with ALADIN-a. 


First, we investigate how to reformulate a tutorial optimization problem in 
partially separable form 6.1. Let us consider the non-convex NLP 


min f(x) =2 (x1 - 1)? + (x; - 2 (6.22) 
X],X2€R 
subjectto — 1 <xı:x2 < 1.5. (6.2b) 


In order to apply ALADIN-a, problem (6.2) needs to be in form of (6.1). To 
get there, let us introduce auxiliary variables yj, yo with y; € Rand y? = 
(y21 yo2)'. Let us couple these variables again by introducing a consensus 
constraint 5; A;y; = 0 with A; = land A» = (-1 0). Furthermore, let 
us reformulate the objective function f by local objective functions fi(y1) := 
2 (y1 1)? and f(y2) = (yz; -2)? with f = fi + fo. Moreover, we reformulate 
the global inequality constraint (6.2b) by a local two dimensional constraint 
function ha = (haı hy)" with h»i(y2) --1-yn y22 and h22(y2) =-1.5+ 
y21 Y22. Combining these reformulations yields 


min 2(y;- 1)? + (y22 - 2? (6.3a) 

yi €R,y2€R? 
subject to - 1- y2 y22 < 0, -15+ y2ıy22 < 0, (6.3b) 
yı + (-1 0)y2 20, (6.3c) 


which is in form of problem (6.1). Note that the solutions to (6.2) and (6.3) 
coincide but (6.3) is of higher dimension, thus one can view the reformulation 
as a kind of lifting to a space of higher dimensionality. Moreover, observe 
that this reformulation contains a general strategy for reformulating problems 
in form of (6.1): if there is nonlinear coupling in the objective functions or 
the constraints, introducing auxiliary variables and enforcing them to coincide 
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$ define symbolic variables 


VE 
y2 = 


Symi yie v ps SEM rear; 
sym y2"; [21] real"); 


% define symbolic objectives 
fls = 2a (Iean; 
f2s = Wale) 


% define symbolic neq. Constraints 


-i-y2(1)*y2(2); 
)*y2(2)]; 


$ convert symbolic variables to 
% 


9. 


$ 


2 


$ 


ill 
f2 


9. 


$ 


h 
h2 


IL. d 
Mop 
Il 


define symbolic variables 
1); 
2); 


define symbolic objectives 
2 s (al => dyes 
(y_2(2) - 2)^2; 


Ss = 
s = 


define symbolic ineg. 
constraints 

U; 

s.p c yall)ey2(2), 
45 + yt) ey (2) 1; 


convert symbolic variables to 


MATLAB functions $ MATLAB functions 
fl = matlabFunction(fls,’Vars’,{y1}); EL = Function ( EL zen, TLS 
f2 = matlabFunction(f2s,’Vars’, {y2}); f2 = Function 627, {yoo}, Rs 
hl = Gov iol = funotriont iil, ig by, hls 
hasnatlabkuneerenilh2sy Vars dyz) | h2 = Fünction( h2"; {y2 1h23 
% define objectives 
fl = qeu 2 * dyl = Rz 
as c fe» (Wea) = 2) 95 


$ define inequality constraints 


jell = (eb qu] 
JE c de J| d — xq 
-1.5 + y2(1) * y2(2)]; 


v WENDY Pees} 


Figure 6.3: Tutorial example with three different ways of problem setup. 


Pa 
3008 


; 
P); 


by an additional consensus constraint in form of (6.le) yields purely affine 
coupling. With that strategy, one can reformulate most nonlinear program in 


form of (6.1e). 


Solution with ALADIN-a 
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Figure 6.4: Collection of variables and output of ALADIN-a. 
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ALADIN-alpha v0.1 


eo 
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es local step size 
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Figure 6.5: ALADIN- a iteration plot for tutorial problem (6.3). 


To solve (6.3) with ALADIN-a, we set up our problem formulation in a struct 
sProb as described in Section 6.4.2. To illustrate different possibilities of 
problem setup for ALADIN-o, we construct the objective and constraints 
functions in three different ways: a), via the MATLAB symbolic toolbox, b), 
via the CasADi symbolic framework and, c), directly via function handles. AII 
these ways are shown in Figure 6.3. 


After defining objective and constraint functions, all function handles and the 
coupling matrices A; are collected in the struct sProb (Figure 6.4). We call the 
run. ALADIN () function with an empty options struct leading to computation 
with default parameters. These steps and the resulting ALADIN-a report 
after running run ALADIN() is shown on the right pane of Figure 6.4. In 
the ALADIN-a report, the reason for termination and timing of the main 
ALADIN-a steps is displayed. Note that plotting takes a substantial amount 
of time—so it advisable to deactivate online plotting if it is not needed for 
diagnostic reasons. Figure 6.5 shows the plotted figures while ALADIN-a is 
running. The figures show (in this order) the consensus violation ||Ax — b|l.o, 
the local step sizes ||x* — z*||,., the step size in the coordination step ||Ax* ||. 
and the changes in the active set. From these figures one usually can recognize 
divergence quite fast and also can get a feeling on the effectiveness e.g. for new 
internal heuristics or the degree of accuracy reached after a certain number of 
iterations. 


6.6 Applications beyond optimal power flow 


ALADIN-a comes with a rich set of examples. These examples include 
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example field directory online docs 
examples/ alexe15.github.io/ALADIN.m/ 
mobile robots robotics/control robots robotEx 


optimal power flow power systems | optimal power. flow redComm 
sensor network estimation sensor network ParallelExample 
chemical reactors chem./control chemical reactor 


Table 6.1: Application examples of ALADIN- a. 


* an example for distributed optimal control of a chemical reactor, which 
we will present in the next section; 


* an example for distributed optimal control of mobile robots [Meh+17; 
Eng+20]; 


* an example for distributed estimation in mobile sensor networks from 
Houska et al. [HFD16]; 


e a logistic regression example from machine learning. 


Furthermore, we included several test problems form the Hock-Schittkowski 
test collection [HS80]. The code for all these examples is available in the 
examples\ folder of ALADIN-a. Furthermore, we provide textual descrip- 
tions of these examples in the documentation of ALADIN-a online [Eng20b]. 
The application examples are summarized in Table 6.1. 


6.7 Distributed control of a chemical reactor 


Next, we show how to use ALADIN-a for distributed optimal control of a 
chemical reactor. This OCP can serve as a basis for distributed model predictive 
control [RMD17; SWR11; MA17]. The chemical process we consider here 
consists of two CSTRs and a flash separator as shown in Figure 6.6 [Cai+14; 
CLM11]. The goal is to steer the reactor to the optimal setpoint 


ul =(0 00) 
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Figure 6.6: Reactor-separator process. 


and 


X, = (369.53 331 0.17 0.04 435.25 2.75 
0.45 0.11 435.25 2.88 0.50 0.12) 


from an initial point 


x(0)" = (360.69 3.19 0.15 0.03 430.91 2.76 
0.34 0.08 430.42 2.79 0.38 0.08). 


After applying a fourth-order Runge-Kutta scheme for discretization, the dy- 
namics of the CSTRs and the flash separator are given by 

P d = au, for ie f, 
where q; : R'» x R'« x R'z — R” are the dynamics of the ith vessel and 
where R := {1, 2, 3} is the set of vessels. Here, x = (XAi, XBi, XCi, T;) are the 
states with xa;,xg; (10? mol/m?) being the concentrations of the reactants, A, 
B and C and T (K) is the temperature. The inputs u; = Q; (10? J/h) denote the 
heat-influxes of the individual vessel and z; are copied states of the neighbored 
reactors influencing reactor i, i.e. z; := (xj)jew(i;). Note that the feed-stream 
flow rates Fio, F20, F3, Fr and F are fixed and given. Detailed equations for 
the dynamics of the CSTRs/separator are given in [CLM11]. 
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The optimal control problem 


With the above, we are ready to formulate a discrete-time optimal control 
problem 


1 1 
2 ze = xis)" Q; Ge — Xis) + 5 (ui = egg) R; (uk — uis) 


1 
(XE. zF,uF) 


i kel 
Vkelj, T) TER €in T] 
VIER 
s.t. x - qi(x¥, u% $3 25) 20 0, x? =x;(0) Vk €lp T], VIER, 


uXubszü;, x,<x, Vkelp rp Vie R, (6.4) 


22A ur) -0 Vkely n. 
ief 


with lower/upper bounds on the inputs u--uz-(5.10^ 1.5-10° 2.10)" 
and lower bounds on the states ae = 0 for all times k € Ij; rj and all vessels 
i € R. The weighting matrices are chosen to Q; = diag(20 10° 10? 10°) 
and R; = 10-!%. The matrices A; are selected such that they represent the 
constraint z; := (xj)jew(;. The sampling time is Ah = 0.017 and the 
horizon is T = 10h. By defining X] := (xF' zF u$”) ken ai fii) = 


L L 


È kely T] 5 (xk u Xis)! Q; (xk — Xis) + Likely T-1] 5(uk = Wig) R; (uk uS ur. 

ze (yk k „k ok =) ._ k jk = 
8i (i) := GEI - qi(x; 5 U; sZ )) kely TJ and h;(3;) := ((u; u Uy —Ui 
x= XP een ,, one can see that the OCP (6.4) is in form of (6.1) and thus 
solvable by ALADIN-a (X; here corresponds to x; in (6.1)). 


Numerical results 


Figure 6.7 shows the resulting input and state trajectories for one OCP (6.4) 
for basic ALADIN and ADMM after 20 iterations, and for ADMM after 
100 iterations. At first-glance all trajectories are quite close to each other. 
However, small differences in the input trajectories can be observed. Figure 6.9 
shows the convergence indicators from Chapter 5 over the iteration index k. 
In logarithmic scale, these differences can be quite large. Fore example the 
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Figure 6.8: Optimal state trajectories computed by ALADIN & ADMM. 
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Figure 6.9: Convergence of ALADIN and ADMM for OCP (6.4). 


consensus gap || Ax* — b ||., is in an order of 10! after 20 iterations which means 
that the physical values at the interconnection points have a maximum mismatch 
of 10'. ALADIN converges quite fast and also to a very high accuracy. All 
trajectories were computed with run. ALADIN() and run. ADMM(). 
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This thesis aimed at designing distributed and decentralized optimization al- 
gorithms for problems with non-convex constraints. In this context, a fast 
convergence under limited information exchange is key. 


Chapter 2—Basics of Distributed Optimization 


To build a foundation for algorithm development and for reviewing the litera- 
ture of current distributed optimization methods, we briefly recalled distributed 
optimization algorithms such as the classical Alternating Direction Method of 
Multipliers (ADMM) and the more recent Augmented Lagrangian Alternat- 
ing Direction Inexact Newton (ALADIN). We provided illustrative examples 
showing that classical algorithms such as ADMM might exhibit slow converge 
due to the lack of constraint information in the coordination step in combination 
with alternating projections. We also showed that ALADIN is able to overcome 
these limitations due to considering constraint information in its coordination 
step. Moreover, we emphasized that for certain non-convex problems, global 
convergence is out of reach—even for centralized algorithms. 


Chapter 3—A survey on distributed optimization 


We provided a literature review on distributed optimization in Chapter 3. 
Here, we showed that distributed and decentralized optimization algorithms 
from different communities are barely able to handle constrained non-convex 
problems from power systems and control we have in mind. To this end, we 
categorized the literature on distributed optimization along three main lines 
of research: primal algorithms, primal-dual algorithms and internal decom- 
position methods. Typically, the communities working on either of the above 


131 


7 Summary and Outlook 


research lines are mostly disconnected—this literature review attempted to take 
a more general perspective. 


We showed that especially primal-dual algorithms and internal decomposition 
methods seem to be promising because of their effective constraint-handling 
capabilities. We classified ALADIN as a combination of primal-dual and inter- 
nal decomposition methods combining the best of these two lines of research: 
distribution and fast convergence for non-convex problems. We emphasized 
that— despite of its very promising convergence properties—drawbacks of AL- 
ADIN are its high communication and coordination requirements, and a lack 
of decentralization. 


Chapter 4—Bi-level Distributed ALADIN 


In Chapter 4 we developed one of the first frameworks for decentralized opti- 
mizatio with convergence guarantees for problem with non-convex constraints. 
We presented two specific decentralized algorithms for distributing the coor- 
dination step of ALADIN: a decentralized variant of ADMM and a novel 
decentralized variant of the conjugate gradient algorithm. Decentralized con- 
jugate gradient has the appealing property of a convergence in a finite number 
of steps, whereas state-of-the-art algorithms such as ADMM are often guaran- 
teed to converge at a linear rate at most and the convergence modulus can be 
slow. Decentralized conjugate gradients and decentralized ADMM introduce 
inexactness into the coordination step of ALADIN. We showed mathemati- 
cally that, despite this inexactness, the very fast local convergence properties 
of ALADIN can be preserved if the error in the coordination step decays fast 
enough. 


Possible directions for future work 


As bi-level ALADIN is guaranteed to convergence locally at present, further 
research on distributed and decentralized globalization routines seems to be 
important. Moreover, the amount of tuning one has to invest typically increases 
with the problem size. Globalization routines and internal auto-tuning routines 
are thus promising candidates to address these challenges. 
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The amount of communication and the overall convergence of bi-level AL- 
ADIN depends on the accuracy and thus on the number of inner iterations 
in the coordination step. Distributed preconditioning might help here to re- 
duce the number of inner iterations. Similarly, a dynamic adjustment of the 
termination criterion in the coordination step might help to decrease the total 
communication requirements of by using less inner iterations. 


Chapter 5—Application to Power Systems 


In Chapter 5, we evaluated the performance of bi-level ALADIN on optimal 
power flow problems up to several hundred buses. We started by reviewing 
relevant literature in this context and concluded that current approaches either 
have no convergence guarantee for AC OPF, use oversimplified models or 
relaxations, where the solution to the original problem might not be recoverable 
from the relaxed problem. Other algorithms require a large amount of central 
coordination. 


We proposed ALADIN and bi-level ALADIN as alternatives and compared 
their performance ADMM as one of the most-frequently used distributed op- 
timization methods for OPF. We showed that bi-level ALADIN is able to con- 
verge much faster than ADMM—also for a limited amount of inner iterations. 
Moreover, we highlighted that bi-level ALADIN with conjugate gradients is 
able to reach about the same low communication footprint of ADMM for 
certain problems, while converging to a much higher accuracy. 


We also showed that for bi-level ALADIN with ADMM as an inner algorithm, 
the achievable accuracy is often limited due to the limited accuracy achievable 
by ADMM in the inner loop. This effect seems not to be investigated in the 
literature so far. Moreover, we showed that tuning of inner ADMM might 
be difficult since the optimal penalization parameter changes with the outer 
ALADIN iterations. 


In the distributed OPF literature, high penalization parameters in combination 
with a feasible initial point is sometimes used for ADMM. We showed numer- 
ically and mathematically that, in its extreme, this combination makes ADMM 
getting stuck at the initial point. We also showed that pure ADMM is able to 
converge only up to a limited accuracy for the problems we consider. 
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Possible directions for future work 


The considered OPF problem forms the basis for many other problems such 
as reactive power dispatch problems, state estimation problems, power flow 
problems, and redispatch problems [Du+19; Mur+18] all of which might 
benefit from distributed optimization. Also multi-stage OPF problems, where 
OPF problems are coupled in time via storages might be of interest [FE19]. 
Broadening the view from power systems to energy systems, application in 
other domains such as building control [Su+20; Zwi+19] seems promising as 
also here a limited information exchange and decentralized computation are 
often desirable. 


Investigating the slow convergence of ADMM for certain problem classes and 
particularly in context of OPF in more detail seems to be important. Especially 
the modulus switching from Appendix C is key, since slow convergence can 
be observed even for very small problems without constraints. This is relevant 
for bi-level ALADIN, as ADMM as inner algorithm suffers from this slow 
convergence and make the achievable accuracy of bi-level ALADIN with 
ADMM limited for OPF. 


Tuning of bi-level ALADIN becomes increasingly difficult with a growing 
problem size—also for problems from power systems. Hence, testing new 
globalization routines and internal heuristics on problems from power systems 
seems worth investigating. 


Chapter 6—The ALADIN-o toolbox 


In Chapter 6, we presented one of the first general-purpose open source tool- 
boxes for decentralized non-convex optimization named ALADIN-a imple- 
menting the algorithms from the preceding chapters. The toolbox enables 
rapid-prototyping of distributed and decentralized algorithms in a modular 
framework. The MATLAB toolbox comes with a rich set of code examples 
for distributed and decentralized optimization from different engineering fields 
ranging from power systems, via chemical engineering, to mobile sensor net- 
works and robotics to machine learning, highlighting its broad applicability. 


We investigated numerical performance of ALADIN in comparison with 
ADMM on an optimal control problem for a three-vessel chemical reactor. 
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Moreover, ALADIN-a@ comes with advanced features such as parallel comput- 
ing and parametric programming enabling its usage in context of distributed 
model predictive control. 


Possible directions for future work 


It seems promising to develop a generalized variant of ALADIN-a, where 
the user can pass own solvers and differentiation routines eliminating the 
dependency on the automatic differentiation tool CasADi. This might help to 
speed up computation—especially for large problems. 


For optimal control, a real-time implementation of ALADIN-a with code 
generation seems promising. An implementation in free languages such as 
Julia or Python also seems important to enlarge the possible user base. 


Closing remarks 


With bi-level ALADIN, we presented one of the first families of algorithms 
for decentralized optimization with non-convex constraints. We showed that 
bi-level ALADIN is guaranteed to converge fast under a limited information 
exchange. We demonstrated that these properties also hold in practice for rel- 
evant problems from power systems and control. Moreover, we presented one 
of the first toolboxes for decentralized non-convex optimization implementing 
the algorithms from this thesis. 


However, these are merely first steps and future will tell which algorithms work 
robustly for a broad variety of problems. 
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A.1 Mathematical Basics 


We recall some basic mathematical results, which we used in our derivations. 


Matrix norms 


Recall that the following properties of matrix/vector norms for matrices A, B € 
R'"*"' and scalars a € R must hold 


lAl 2 0 (positive-valuedness), (A.la) 

lle AIL = [a [AI (absolute homogenity), (A.1b) 

IA + Bll < ||A|| + || B]|| (triangular inequality), (A.1c) 

lAl = O iff A 20 (definiteness). (A.1d) 

Note that we assume in all our derivations that matrix norms are induced 
by their corresponding vector norm, i.e. for a given vector norm || - || on 
R”, the induced matrix norm for a matrix A € R"*" with respect to || - || is 
|| A] := SUP eR”, || x||=1 lAx]]. For induced matrix norms, the following property 


holds additionally 

ABI] < |[ATI| Bll (sub-multiplicativity). 
Moreover we will need that given two matrices A € R'*",0 € R^*!. we have 
(A Oll = SUDI(yT. yT){J=1 IKA OQ yT) = SUP || x||=1 |Ax|| = All, 


where the last step follows from the fact that the fact that the choice of y does 
not change the supremum of || Ax ||. 
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Another important property which we will use is the "triangular inequality for 
integrals" for integrable functions F : R —> R”, 


[ F(t)dt 


which can be shown by the above properties and the definition of the Riemann 
integral. 


b 
< f FC) liar, (A.2) 


Definiteness of matrices 


A matrix M € R"*" is called positive definite if x' Mx > 0 for all x € R”. 
If the before inequality holds strictly, M is called strictly positive definite. 
Equivalently, we write M > 0 and M > 0. 


Fundamentals from calculus 


By the fundamental theorem of calculus we have for any continuously- 
differentiable function g : [a b] > R 


Í "nat = gt) - 8(n). 


t 


For an extension to the multivariate case let us consider a continuously differ- 
entiable vector-valued function f : R” — R. Define g(t) := f(a + t(b — a)) 
and thus g’(t) = V f(a +t(b — a))' (b — a). Then we have 


1 
f Vf(a+t(b-a))'(b-a)dt= f(b) - f(a). 


The extension to a vector field F : R” — R” is done by concatenation of 
single-valued functions f; : R” — Ras F(x) := (fi(x),..., fin(x))" yielding 


1 
/ VF(a+t(b-a))'(b-a)dt=F(b)-F(a), (A.3) 
0 


138 


A.1 Mathematical Basics 


for some a, b € R", where VF is the Jacobian of F and the integral is evaluated 
component-wise. 


Convergence of sequences 


In context of optimization algorithm, two question are of significant interest: 
first of all, the question “Does the sequence generated by the algorithm converge 
to a minimizer/stationary point?" and, secondly "If it converges, how fast does 
it converge to that point?". 


Let us address the first question. A sequence (x^) in the metric space 
(R", d(x, y) = ||x — y|l) is called convergent to a point x* if for any € > 0 one 
can find an N such that ||x* — x*|| « e for all n > N. Not that for a sequence 
satisfying ||x**! — x*|| < c||x* — x*|| with c € [0 1) one can always find such 
an N, since by iterating this equality we have ||x" — x*|| < c" Ix? - x*|| < e 
by choosing an N large enough. Thus, one can show convergence to x* by 
showing that the above inequality holds. 


To characterize the "speed" of convergence we distinguish between four ba- 
sic convergence rates (with increasing speed): Q-sublinear, Q-linear, Q- 
superlinear and Q-quadratic convergence. The “Q” here stands for quotient 
convergence. Consider a sequence {cx}, that converges to c. We say that this 
sequence is Q-linearly convergent to c with modulus u € (0, 1) if! 


, lex+ı — cl 
lim sup ————— = 
k—oo Ick = c| 


We say that {cz} is Q-sublinearly convergent if the above limit converges to 1 
and superlinearly convergent if it converges to 0. We say that {c} converges 
Q-quadratically to c if 


Ici — c] _ 


lim sup - «oo. 


ke Ck ep 


! Linear convergence essentially means exponential convergence in the sense that a linearly 
convergent series can be upper bounded by a exponentially covergent sequence. The name 
"linearly" convergence comes from the fact that a linearly convergent series looks linear in a 
semilogarithmic plot. 
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Figure A.1: Convergence rates in comparison. 


Note that in view of the above definition of linear convergence, if we can 
find a sequence satisfying |c**! — c| < u|c* — c|, with u € (0, 1), linear 
convergence with modulus u immediately follows. Similarly, we can conclude 
quadratic convergence from an inequality |c**! — c| < ulc* — c]?. Figure A.1 
shows examples of sublinear, linear, superlinear and quadratically convergent 
sequences. 


Note that there is an issue in the above definitions in case the sequence can 
only be upper bounded by a fast convergent sequence but does not decrease 
in each step in the above sense. This might be problematic since some op- 
timization methods do not produce a descent in the objective function value 
in each step. Consider for example the sequence c^ = e~*(1 sin(10 ea): 


This sequence somehow converges “intuitively linearly” to zero but not in 

=(k " » 

e D (Lesin(10 (k+1) £)) 
ek (1+sin(10kZ)) 


the sense of the above definition since lim sup, ,,, 


e^! (1+sin(10 (k+1) £)) 


Orsino) 7 l. 


Therefore, it makes sense to use a weaker notion of convergence rate, the 
root convergence (R-convergence), where convergence is characterized by 
a majorizing, upper bounded, and Q-convergent sequence (cy). In view 
of that we say that the sequence (cx) converges R-(linearly, superlinearly, 
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{c}, R-linearly convergent, but 
not Q-linearly convergent 


(c), R-linearly and 
Q-linearly convergent 


Figure A.2: R-convergence vs. Q-convergence. 


quadratically} to c if there exists a Q- (linearly, superlinearly, quadratically} 
sequence (c4) converging to zero such that 


lc* —e| < č. 


For the above example we can the define the Q-linearly convergent sequence 
& = 2e^* > e^*(1- sin(10 k$)) and thus R-linear convergence of {cx} 
follows. Figure A.2 illustrates the above example. Note that trivially each 
Q-linearly convergent sequence is R-linearly convergence but the converse is 
not true. 


R-convergence can be indicated using the order notation. For the above exam- 
ple we can simply write |c* — c| = O(e~*) indicating R-linear convergence 
of (cx) to c. We say that a function (sequence) f : R — R is of order 
g : R  R with respect to a € RU {-00, 00} if there exists a c > 0 such 
that | f(x)| € clg(x)| for x — a. Here we usually mean the limiting behavior 
with repsect to a — co but in optimization this notion is also sometimes used 
for describing residuals in Taylor series expansions where a is the point of 
expansion. 


Moreover, there exist several ways of characterizing the convergence of op- 
timization methods. They can be characterized in terms of a distance to a 
local minimizer (or a stationary point), i.e. ||x* — x*|| = O(-), in terms of the 
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objective function value | f(x*) — f(x*)| = O(-) or in terms of stationarity of 
the objective function (in the unconstrained case) ||V f(x*)|| = O(-). In the 
present work, we refer to the first case, i.e. convergence estimates considering 
the distance to a local minimizer if not stated differently. Note that, however, 
particularly in context of distributed optimization, convergence is often de- 
fined in terms of the objective function value. Under certain conditions such as 
Lipschitz continuity or strong convexity one can transfer estimates in terms of 
the objective function to a result in terms of the distance to a local minimizer 
[Ber99, Chap 1.2] [Bec17] but not always. Sometimes convergence results are 
stated in terms of a specific accuracy, i.e. the number of iterations needed to 
reach a specific accuracy in the objective function value | f(x*) — f(x*)| < € 
[Bec+18] which we will not use in the present work. 


A.2 Distributed problem formulations 


Here, we give a brief overview on common problem formulations used in the 
literature. Recall that in the present work we consider problems in form of 
(2.12), 


min > ka) (A.4a) 
MEER D 
subjectto _g;(x;) = 0, VER, (A.4b) 
hj(xi) € 0, vi € f, (A.4c) 
2 Aix; = b, (A.4d) 
ie 


where f;,g; and h; are smooth. A common technique to simplify problem 
(A.4) is to replace f by f; := f; +e x,» Where ı is the indicator function of the 
set X; := (x; € R"* | g;(x;) = 0, hi(x;) € 0), cf. Section 2.2.1. This yields 
an equivalent problem 


N 
i F (x; A.5 
„un. 2/4 (xi) (A.5a) 
subject to > Aixi = b. (A.5b) 
ie 
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Note that if all g; are affine and all h; are convex, all f;s are convex but possibly 
non-differentiable and discontinuous functions, which render derivative-based 
algorithms such as gradient-based methods or SQP methods non-applicable. 
However, some algorithms such as ADMM can handle such discontinuities 
under certain circumstances. By introducing auxiliary variables z; € R”*‘, we 
can write (A.5) as 


min f(x) + ic(z) 


subject to x; — Zi = 0, VIER 


with f = Der f; and where C = {z € R™ || Ereg Aizi -b = 0}. This 
problem is now in the famous two-block form 


min f(x) + g(2) (A.6) 
subject to Ax + Bz =c, (A.7) 


used in many contexts of distributed optimization [BT89, Chap 3.4], [Bec17, 
Chap 10], [PB14; EB92; Och+14; BT09; DLS16; GB16; Boy+11]. This form 
is often used in so-called operator splitting schemes having a different view on 
distributed optimization from a operator-theoretic perspective [EB92; GOY17; 
BC11]. Note that the ADMM is shown to be a special case of the so-called 
Douglas-Rachford-splitting algorithm [Gab83]. Generally, splitting schemes 
can often efficiently exploit the fact that one of the two functions f or g is 
separable which is also here case for f. Note that these splitting schemes often 
rely on convexity which can be ensured by choosing convex f; and h; and affine 


8i- 


Consensus problems 


In many cases, distributed optimization approaches consider unconstrained 
problems in form of 


min >, fio). (A.8) 


ie 
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in the literature [Shi+14; Shi+15b; NO09; MO17; DLS16; NOP10; Boy+11; 
ZK14; NOS17]. Here, the f; are typically continuous but they might not 
be differentiable, i.e. they might consider non-smooth components but do 
typically not encode constraints via the indicator function as in the previous 
subsection. Note that here all f; depend on one common decision vector x. 
Similar to the above, by introducing R copies of x, z; € R”, it is possible to 
transform (A.8) into so-called consensus form 


min 2 Fa) (A.9a) 
OU TER 
subject to z; 2x forall i1ER (A.9b) 


which is again in two-block form (A.6). Note that in this formulation, the 
network structure is not explicitly considered which makes this form interesting 
mostly in parallel computing contexts. Next, we will review problem structures 
considering the network structure. 


Connection to optimization over networks 


Many works consider distributed optimization over networks [DLS 16; Shi+14; 
Shi+15b; NO09; TT17]. Therein, problem (A.8) is typically reformulated as 
consensus problems but subject to certain communication constraints which 
are encoded as a graph G = {N,&}. One option for doing so [Shi+14] is to 
introduce global auxiliary variables z;; only for the edges (i, 7) € & yielding 


min 2. fi) (A.10) 
TER 
subject to x; = Zij, xj = Zij for all (i, j) € €. (A.11) 


Note that (A.10) is again in two-block form (A.6). An alternative option 
is copying all variables and enforcing consensus directly via characteristic 
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matrices of G such as the Laplacian matrix L € RINIxIN| or its incidence 
matrix J € R!*!*!&! [Dor18] for one-dimensional problems. This yields 


min A fix) (A.12) 
* IER 
subject to Lx=0 or I'x=0. (A.13) 


Applying ADMM or a gradient/subgradient method reveals that it suffices to 
exachange information only between neighbors, i.e. over the edges (i, j) € 6. 
Note that due to these reformulations, properties of G such as connectedness 
or algebraic connectivity influence the convergence properties of the resulting 
decentralized algorithms, where stronger connectivity typically leads to faster 
convergence [Shi+14]. 


Algorithms for (A.8) are developed without explicitly introducing the graph 
properties into the problem formulation [NO09; Shi+15b; NOS17; DLS16; 
YLY16]. In these works, typically matrices W encoding connectivity infor- 
mation are directly introduced into the iteration schemes e.g. of a gradient 
method. Recent overviews on optimization over networks can be found in 
[NOW18; NOR18], cf. [Dor18] for an introduction from a control-theoretic 
perspective. 


The here-used connectivity encoding technique 


In the present work we use network encoding which is closets related to the 
edge-based technique form the previous paragraph. Our formulation is more 
general as the before-introduced techniques in the sense that it allows for 
multiple decision variables per node, where all these decision vectors can be 
of a different dimension. This already shows that in the context of our work we 
mainly rely on *complexity in the nodes" in the sense that usually we have few 
nodes, each of which has a quite complex subproblem. Rather than defining 
the graph a-priori, we follow a different route: we define our communication 
graph based on the sparsity in the consensus matrices {A;}. In view of (A.4), 
we have a node set R to each of which we assign a decision vector x;, for all 
i € R. We define the edges of the graph via the non-zero entries in (A;): two 
subsystems i, j € R are called neighbored (and thus we assign an edge (i, j) 
to the set &) if there is a row c € C which is non-zero in bot corresponding 
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A; and A;. This way, our formulation is more general compared to the other 
formulations above since we allow for multiple variables per node and for 
multiple interconnections between two nodes. 


Moreover, other couplings exists such as objective coupling, where the coupling 
between variables takes place in the objective function. However, typically they 
can be reformulated in constrained-coupled form. We refer to [NND11] for a 
more detailled treatment in context of control. 


A.3 Consensus reformulation preserves LICQ 


In order to use distributed optimization methods, it is often necessary to refor- 
mulate problems in affinely-coupled separable form (2.12) as we have seen for 
OPF in Section B.4. One approach for doing so is copying decision variables 
which couple the subsystems nonlinearly and coupling them again by an ad- 
ditional affine equality constraints which then immideately leads to problems 
in form of problem (2.12). In this case, an important question is whether or 
not we "destroy" certain nice properties of the original problems such as con- 
straint qualifications from Definition 2. Next, we show that this is not the case, 
i.e. that copying variables and adding additional affine coupling constraints 
preserves LICQ. 


Let us assume we have an optimization problem with two blocks of variables 
x € R™,y e R” subject to equality constraints g : R"* x R — R"s with 
Ng € ny * ny. We now want to show that a introducing auxiliary variables 
z € R™ preserves LICQ. 


If LICQ holds for g, we have 
rank (Vg(x, y)) = ng 


in a small neighborhood of (x*, y*). If we minimize over an extended set of 
variables (x, y, z) € R"**?"», the constraints are given by 


gx, yz): Biel = 0. 
y- 


z 
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Moreover 


VxgGy,z) 0 VzgGny.z) 


A.14 
0 I -I ; ) 


Vē(x, y, z) = 


By LICQ, the first row of (A.14) has rank ng. Using the identity 
rank (AT BT)" = rank(A) + rank(B — BA" A) 


then yields rank (Vg(x, y, z)) = ng + ny. This shows that LICQ is preserved. 


A.4 Proof of Theorem 4 


The proof is similar to standard proofs for Newton-type methods from [Die16; 
NWO6; Ber99] and given here for the sake of a self-contained presentation and 
because of it's importance for the convergence analysis of ALADIN. The idea 
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is to estimate the distance to a local primal-dual solution to (2.12) q*. With 
the Newton-type iteration (2.7) we get 


Ia**'- a* || = la^ — q* - (M)! F(q*)II 

= lla — q* - (M) (F(q*) - F(q*)) II 

= (045! M*(g* - 4*) 
1 

- (u^! f VF(a* «t(q* — q*))(4* - q*)dtl 
= (04057! (M* - VF(q*))(q* - 4”) 
1 
- [| or? (vet erat = a) = FG) (al aan 
< ||)! (M* - vr(G | |a* - "|| 


1 
«f. [or (sr et 4°) - ve) ata - el 


IA 


1 
«la^ - a*l f olla - alldt||a* — 4*] 


w 
(«+ Sila - atl) ua - all 


where we used (in this order) F(q*) = 0, the fundamental theorem of calcu- 
lus (A.3), continuous differentiability of F and adding x (M^)-! VF(q*)(q* — 
q*) and pulling it into the integral, (in one step) the triangular inequality— 
(A.2)—submultiplicativity—and standard properties of integrals, the Lips- 
chitz condition (2.10a) and compatibility condition (2.10b) and the positiv- 
ity of norms, and finally evaluation of the integral. Convergence follows 
by inserting ||g^ — g*|| < 20-1) into the contraction estimate leading to 
lg** — g*l| < llg^ — q*|| which is sufficient for convergence to q*. =) 
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Using a sufficiently accurate grid models is extremely important. In this sec- 
tion, we will provide a brief introduction to the AC power system model provid- 
ing a sufficient accuracy for decision making on a seconds-to-hours time scale. 
This model describes the nonlinear relation between the injected/consumed 
powers at all nodes and the corresponding voltages. One often-employed alter- 
native is the so-called DC-model which is often employed in context of power 
system optimization.! Note that this model is often insufficient due to the lack 
of voltage and reactive power modeling. 


B.1 The AC model 


Let us make the assumption we use for the AC model explicit. We assume 
(i) time-scale separation; 
(ii) sufficiently short transmission lines; 
(iii) the absence of nodal shunts and transformers; 
(iv) asymmetric power system; 


(v) and a constant-power load model. 


! The DC models assumes constant voltages, no resistances in transmission lines and small 
voltage angles yielding a linear model of power systems enabling to use all benefits from convex 
optimization. However, while these assumptions represent the physical laws in transmission 
grids quite well, in distribution grids they are often violated to a large degree for example due 
to high resistances and high power injections from renewables at weak nodes leading to quite 
large voltage angle differences [WW 13]. 


149 


B Power System Fundamentals 


Assumption (i) implies that we have a steady-state and transient phenomena 
can be neglected; assumption (1) together with assumption (ii) allows for using 
a z-equivalent model of transmission-lines/cables. Assumption (iii) is mainly 
made for simplified presentation, including shunts and transformers poses no 
significant difficulty from an algorithmic perspective—however it somewhat 
complicates model derivation. Assumption (iv) implies that we can represent 
all three phase by their single-phase equivalent. Assumption (v) is made to 
be able to avoid the difficulty of handling different load models which is quite 
common under for the here-considered steady-state problems [Ari+18]. 


Remark 20 (Discussion of the above assumptions) Finer-grained models exists 
allowing to violate the above assumptions. Examples include using hyperbolic 
equations for the transmission lines (or even partial differential equations) 
[GS94, Chap 6], using three-phase model for unbalanced grids, using voltage- 
dependent demands [Ari+18], considering voltage-regulated buses and con- 
sidering transformers and line shunts. However, there is always a trade-off 
between sufficiently accurate modeling and complexity of the resulting prob- 
lem. One usually aims at using a model which captures the relevant effects in 
to sufficient accuracy, but neglects further effects in order to keep the resulting 
optimization problems tractable. With the set of assumptions given here, we 
are still able to handle voltage bounds and reactive power. 


The AC power flow equations 


We model an electrical grid by an undirected graph G = (N,6&,Y), where 
N = {l,..., N} is the set of buses, E € N x N is the set of branches and 
4:6 Œ, (k, D => (Ykı, Y%ı) is a map assigning a line admittance and a 
shunt admittance to each branch. We use these the admittances in a z-branch 
model shown in Figure B.1. In order to derive the relationships between 
bus powers and all bus voltages, we investigate the relationship between the 
complex powers flowing into/out of the branch sxı, Sıx € C and the voltage 
phasors at the beginning/end of the branch u, € C and u; € C by means of 
circuit theory. 


The complex power flowing from node k to node / is sy; = uyi;;, where 
iy; € C is the complex current over the branch (k,l) and z* denotes the 
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Figure B.1: z-line model of a branch. 


cM QU at of ze C. Furthermore we have ij; = (uy — uj) yii; and 


İki = ikl = up L . Thus, 


* * y 
Ski = Uk [e -urua i), (B.1) 


By the law of energy conservation, the complex (net) power at node k, sx has 
to be the sum of all powers flowing into connected transmission lines yielding 


N s* 
T3 [o - yu De forall keN. (B2) 
k=1 


By defining a bus admittance matrix 


Ykm : 
[X k Pe if k=l, 
[Yu] Ykm 


—Ykl. if kzl, 
we can write (B.2) compactly as 
s = diag(v) Y*v*. (B.3) 


Here, v = (Udkaı.... 
S = (Sk)k=i,..,n € CN denotes the vector of all complex power injections and 
we define diag(z) € C”-*"% as a diagonal matrix with the entries of z € C"« 
on the main diagonal. If multiple components such as loads, generators and 
storages are connected to one node we have 


sy =, — 8, - s} forall keN, (B.4) 
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Figure B.2: Apparent power balance at node k. 


where E € C denotes apparent power of a generator, si € C denotes the 


apparent power demand and s; € C denotes the apparent power of a storage 
all connected to bus k. If there is no generator/demand/storage at a node, we 
set the corresponding power to zero. 


As numerical algorithms are typically designed for real-valued spaces, we have 
to transform (B.3) to the real domain. Here, we have different possibilities: we 
can represent the complex voltages u; at all nodes k = 1,...,N in cartesian 
coordinates uz, = VE +j ym or in polar coordinates ug = vie? % where vz is 
the voltage magnitude and 6; is the voltage angle at node k. The same choice 
we have for the complex line admittances yx;. The choice of these coordinates 
leads to different forms of the power flow equations—for details we refer to the 
excellent survey [FR16]. Here we use the combination of polar coordinates for 
v and s, and rectangular coordinates for Y = G + j B, which is the most popular 
formulation for OPF. By doing so and by taking real and imaginary parts of 
(B.3) we get 


N 

Re(sk) = pk = Vk 2. vi(G yi cos(x) + Bri sin(0j1)), (B.5a) 
k=l 
N 

Im(sx) = qk = vk ) vi(Gr sin(0ki) — Byj cos(Ox1)), (B.5b) 
kel 


for all nodes for all nodes k € N, where 0x1 := 0, — 0; and where px is called 
the active power and qx is called the reactive power at node k. Note that p; and 
qx are net powers (i.e. the residual between generation and demand at node 
k); substituting p* and q* in (B.5) with the corresponding real/imaginary part 
of equation (B.4) yields power flow equations in terms of component powers. 
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B.2 Power flow analysis 


Recall that classically the only quantities which we can directly influence in 
power systems are active and reactive power injections from generators pt and 
qi. But how to compute the voltage angles 6; and voltage magnitudes v; for 
given power injection and demand? This question is subject of so-called power 
flow analysis. 


As the power flow equations (B.5) are usually not algebraically solvable for 
(v, 0), we have to use iterative methods. Let us write the power flow equa- 
tions (B.5) with (B.4) as 


pi- PS = DS. — vkXgavi(Gi cos(614) + Bri Sin(Ox1))\ _ 


Fk(Xk):= à - 
qt -qi-qi- vk Xia vi(Gusin(y) — By cos(814)) 


for all nodes except the first node k = 2,...,N with yk := (vr, Ox,)'. Let 
us assume that the variables Pio djs Pis Pio ds and q; are fixed and given 
here. This yields an implicit function F(x) := (FxCyx)) ga 
79 NE 


n> Where x := 


— 


Bus one is not included in F and has a special role here: technically speaking, 
this bus has to compensate the power demand (active and reactive power) 
which is not covered by the other generators. Physically this follows from 
the law of energy conservation. Mathematically, this follows from the fact 
that the last power flow equation can be expressed as a linear combination of 
the other power flow equations (if no shunts are present as we assume here) 
implying that one of the power flow equations is "redundant". This leads to 
the situation that we can not apply the implicit function theorem for locally 
inverting F (because of the singularity of VF). Hence, also iterative methods 
such as Newton-type methods would fail. The common approach for tackling 
this issue is to introduce a so-called slack-bus at a bus with a large generator 
(typically bus one), where instead of fixing the net powers p; and q1, we fix 
the voltage angle 6; and voltage magnitude vı. Hence, the active and reactive 
powers become “free variables" at this node in the sense that they are a result 
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rather than a precondition of our computation? Thus, the extended implicit 
function reads 


ref ! 
F(x) = (v-v, 9-08, (Peen) 20, B6 


with x := (pı, qi, (Vk, Ok)k)k=2,...,ÊN) where yet and grt are constant and 
given reference values. Note that the components of F become linearly in- 
dependent and VF(x) becomes invertible. The above equations can now be 
solved by standard Newton-type methods from Section 2.1.2, which is then 


called a power-flow study. 


Remark 21 (Different bus types) Note that we assume here for simplicity that 
we only have two bus types: "pq-buses", where active and reactive power is 
given and a slack bus, where the voltage angle and magnitude are given. In 
many works, also a third bus type is considered (*pv-buses"), where active 
power and the voltage magnitude are given and the voltage angle and the 
reactive power are unknown. This comes from the fact that some generators 
are equipped with voltage regulators, which keep the voltage magnitude at a 
certain value by injecting reactive power. We leave this out for simplicity here 
as our goal is to outline main idea of power flow computations and we refer to 
[GS94; WW 13] for further details. 


B.3 Optimal power flow 


The classical goal of optimal power flow is to compute (in a certain sense) 
optimal set-points for all controllable devices in a grid such that a certain cost 
function is minimized. Thereby, technical and physical limitations such as 
power generation limits of the power flow equations (B.5) are considered. 


? This assumption is technically sound if a large generator is connected to the first bus which can 
typically be ensured by an appropriate numbering of the buses. 
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In many cases, the objective function in OPF is the sum of the cost function 
for active power generation of all generators [Zhu15; FR16] 


N 
fG) : 5 AOD. 
k=1 


where we have x := (y«)xk=1,....N With yy := (Vk, Ok, P qu and i. q$, 
q}, fixed and given. In contrast to the power flow problem from the previous 
subsection, all generator powers are "free" variables now, i.e. we would like 
to compute them optimally. The individual generator cost functions are often 
chosen as quadratic functions 


JEDE) :- ar (EY + bx p$ + cx 


for all k = 1,..., N with fË = 0 for non-generator buses. Note that the genera- 
tor cost generally does not depend on the reactive power injection qi. Reactive 
power can be seen as a power to charge and discharge transmission lines and 
inductive/capacitive loads—thus this energy oscillates between consumer and 
is thus not related to any generation cost (except for slightly changing grid 
losses). 


Power generators are usually subject to active and reactive power limitations. 
These limitations come from operation limits of the power generation units and 
for the reactive power mainly from current limitations in the generator stator 
winding. Moreover, voltage magnitudes have to stay withing certain bounds 
to ensure proper functioning of the connected devices and to avoid damaging 
the electrical insulation. The current in transmission lines is also constrained 
by thermal limits. This is often expressed (although not entirely correct)? by 
limits on the magnitude of the apparent power over lines 


Isxi(x)| := A Pia? + qa Q2, (B.7) 


? To be more precise, one would have to limit the magnitude of the current over the transmission 
line |ix;|. As voltage magnitudes typically deviate only slightly from the nominal voltage, the 
approximation by an apparent power limit is sufficiently tight in most cases. 


155 


B Power System Fundamentals 


where pj; = Re(sxz) and qx; = Im(sxı) with sz; from (B.1). Stacking the 
apparent powers for all transmission lines yields 


ISPO = (Is?) .- 


Combining the above yields a (single-stage) optimal power flow problem 


min f(x) (B.8a) 

subject to F(x)=0, (B.8b) 
ISP) < 5° (B.8c) 

X<x<X, (B.8d) 


where x and x collect lower/upper bounds on power injections and voltage 
magnitudes; s? collects the upper limits on the apparent power over branches. 


Remark 22 (OPF variants) Note that there exist many variants of OPF, differ- 
ing for example in the underlying grid model (DC model vs. AC model, polar 
coordinates vs. rectangular coordinates, ...) [Zhu15; FR16; MH19], consid- 
ering uncertain power injection and/or demands [FGM18; M#17; Müh+19; 
BCH14] or extensions considering component failure, which is called security 
constrained OPF [MPG87; Cap+11]. Trying to give an exhaustive overview 
is beyond the scope of the present thesis and we refer for example to [FR16; 
FSR12; Zhu15; Cap+11; Cap16] for details. However, the majority of these 
problem formulations are extensions of the "standard OPF problem" presented 
here and we see the here presented formulation as a basis for further studies. 


B.4 Distributed reformulation 


In order to apply distributed optimization algorithms, the above problems have 
to be in affinely-coupled separable form (2.12). In this section we describe one 
way of doing so. 


In (2.12), any nonlinear equality constraint such as the AC power flow equa- 
tions (B.5) have to be local constraints collected either in g; or h;, such that the 
coupling between the subsystems is solely affine and in form of (2.12d). There 
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IN 


C9 generator 
wy consumer 


Figure B.3: Example grid with partitioning. 


are multiple ways for achieving affine coupling: here we introduce auxiliary 
nodes in the middle of transmission lines interconnecting two subsystems and 
coupling active/reactive power and the voltage phasor at these nodes. Other 
works such as [Ers15] use coupling in the voltage angles and magnitudes only. 
Another option is, to introduce variable copies of the bus states connected to 
border lines and to couple the original states again with their copies by means 
of (2.12d) without introducing any additional buses [HA09]. Our formulation 
has the advantage, that no “internal information" from the interior of the sub- 
systems has to be exchanged and that two neighbored subsystems share the cost 
for the line losses of the coupling line. However, other methods may benefit 
from a reformulated problem with a smaller dimension due to the absence of 
auxiliary buses. 


Let us assume that a grid partitioning is given (for example by the control 
the area of responsibility of the corresponding DSOs/TSOs or by borders 
between different grid levels). This means that we have a set of subsystems 
R = {1,..., R}, each equipped with a corresponding subset of buses N; C N 
where these subsets are disjoint N; NN; = 0 for all i, j € R with i + j. Let us 
consider a concrete example: Figure B.3 shows a partitioning for the example 
5-bus system from [LB10]. We decompose the node set N = {1,...,5} into 
three regions R = {1,2,3} with M = {1,5}, No = {2,3} and M = {4} 
fulfilling the above assumptions. 
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Splitting lines between subsystems 


Consider the z-branch model from Figure B.1 shown in Figure B.4, where 
we assume that the line between node k and node / is a transmission line 
connecting two subsystems. We introduce two auxiliary buses m and n which 
are connected to the corresponding buses k and / in the subsystems—thus, 
we replace the original z-branch model by two z-branch models in series.* 
Regarding the line parameters, we double the series admittance of the original 
connection yy; and divide the shunt admittance y}, by two. With that we get 
an auxiliary buses m and n, which we collect in an auxiliary node set N^ and 
an auxiliary node pair (m,n) which we collect in a set A € N^ x N^ and 
new transmission lines £^ c N x N^. To obtain equivalence to the original 
problem, we require 


Um — us and Skm=-5n,1 foral (m,n) E€ A. (B.9) 
Taking real part and imaginary part of (B.9) yields 


Vm — Vn Pk,m = —Pn,l (B.10a) 
Om = On, Qk,m = — nl: (B.10b) 


for all (m, n) € A resulting in an affine coupling in form of (2.12d). 


In order to simplify presentation, we will assume from now on that the auxiliary 
buses are already included in N and N; respectively. More precisely, this means 
that we introduce extended bus sets NE = N UN? and NE -N;UÍme 
N^ | there exists (k,m) € 8%}. In the following we use the symbol N for 
the extended bus set NË for simplicity. 


^ Note that this reformulation is equivalent to the original model only if Yk, , = 0. However, froma 
physical perspective, a series connection of multiple z-elements can in general more accurately 
describe the physical behavior of the branch as it allows for a more accurate approximation of 
the hyperbolic equations of transmission lines [Sch17, Ch. 10.3]. 
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kl Uy €? Im 


. e 
Skm || Snl 


Figure B.4: Decomposition of a z-line model. 


The reformulated problem 


With the above, we can construct the power flow equations for each region 
i € f as 
Fi(xi) = (Fk (Xk) ken 


where x; := (Yx)ke N;. Similarly, we have for the line limits and box constraints 


ISRO == (Ind) and xo sisi 


for all i € R. Furthermore, we define for alli € R 


fio): >) fico. 


ke Ni 
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This yields an OPF problem in form of (2.12) 


min 2800 (B.11a) 

Xl,- XR 4 

iER 

subject to F;(x;) 20 forall VER, (B.11b) 
ISE) « $2 forall ie, (B.11c) 
X; € X; € Xi forall VER, (B.11d) 
YA x; 20, (B.11e) 

ie 


where the matrices A; € RIIAT 4^ are constructed such that the coupling 
constraints (B.10) are satisfied. 
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C.1 Modulus switching in ADMM 


As we have seen in Section 5.5, d-ADMM seems to be able to reach only a 
limited accuracy. In this section we will investigate this behavior on a small- 
scale example. It turns out that the accuracy is not limited, but the convergence 
is so slow that is seems like ADMM is not making any progress. Moreover, 
we show that by choosing a different p, one can achieve a higher accuracy but 
then the overall convergence to that accuracy is slow. So there seems to be a 
trade-off between the achievable accuracy and the convergence speed. 


Consider an unconstrained strongly convex QP with objective function f(x) = 
$x" Hx — h' x, where H = Ser Hi > Oand h = Freg hi with ADMM similar 
to the inner problems in bi-level ALADIN from Section 4.2.1. Let 


a (C. € a ctd c 0 d O0 
Hı=|c d 0|, Fh=|c+d b d|, H=|d b dl], 
c 0 0 C d 0 0 d O0 


and hı = (a 0 O0)',h;-(a b 0)',h3=(0 b 0)" witha = 10%, 
b =5,c = 107 and d = 1077. Then we have a condition number cond(H) = 
6.25 - 10^ which should not pose too severe difficulties. 


Figure C.1 shows the resulting convergence plots for d-ADMM for different 
values of p. One can clearly see that the accuracy in the primal and dual 
variables seems limited for this problem to round about 107? for p = 107^. 
By choosing a larger p of 5 - 107? or 1072, one can achieve higher accuracies, 
but in this case one needs several thousand iterations to reach an accuracy 
of 107? in the primal variables. In practice, this is often not an option and 
especially not in bi-level ALADIN as then the overall communication would 
be very high. So in practice, tuning of p is typically done in a way such that one 
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Figure C.1: ADMM for solving a unconstrained convex QP. 


gets fast convergence in the primal variables in the beginning of the ADMM 
iterations. Hence, for the here-considered example one would probably choose 
a p around 107? since on can then reach an accuracy of around 1074 after 
several tens/hundreds of iterations. By doing so one also essentially limits 
the accuracy of ADMM to that level. Note that for small p = 1075, ADMM 
diverges as the individual f;s are unbounded below since the matrices A; are 
indefinite. 


From our example one can observe another interesting effect: ADMM seems 
to switch it's linear convergence modulus while iterating. Especially in case 
of p = 107? illustrates one can observe this behavior. A similar behavior is 
also observed in other works [NOB16; Gol+14]. To the best of our knowledge 
this effect is not investigated in the literature so far. One can also draw 
another conclusion from this example: although for strongly convex problems 
(which our example is), linear convergence has been shown (e.g. [GB16]), 
the present example (and the examples from the inner problems in bi-level 
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ALADIN (Section 5.5)) shows that the modulus of convergence for ADMM 
can be arbitrarily bad. Thus, one has to be very careful with the interpretation 
of such results—if there considered situation occurs one essentially needs a 
very large number of iterations for convergence. 


Conditions for which slow convergence occurs seems to the best of our knowl- 
edge not to exist in the literature. In our particular case the matrices H; are 
indefinite leading to non-convex f;s, which might be one reason. However, 
investigating this in more detail seems to be an important direction of future 
work. 


C.2 Examples: ADMM getting stuck for p — co 


We give an analytic example where one can observe that ADMM gets stuck for 
p — œ and a feasible initial point as predicted by Proposition 1. Moreover, 
we show that ADMM converges to a feasible but not optimal point in case 
of an infeasible initialization for the same example. In a second example, 
we show that evaluating primal feasibility only is in general not sufficient for 
convergence of ADMM, which is sometimes assumed in the power system 
context [Ers15; GHT17]. 


Example 4 (ADMM getting stuck for p > co) Consider the problem min, ne 


subject to x = z and x > —1. The partial augmented Lagrangian with respect 
to the constraint x = z is £5(x, z, A) = 5x? + A(x -—z) + E(x - 2)”. 

Case 1: Consider a feasible z°, i.e. z? > —1. Applying ADMM (Algorithm 3), 
minimizing Lp with respect to x yields x**! + A* + p(x**! — z*) = 0 and 


k_ ak k_ kj, pro , OM ; : 

thus x**! = ae = ms — z*. Minimizing with respect to z yields 
k k_ qk k poo 

Ak p(x**1 - k+l) = Oand thus z+! = xt! = 2 74/0 ae ^ y" ok The 


p 1/p+1 P 
Aupdate in step 3) reads A**! = A e p(x**1 - zl) = AK + p(xkt!—x ht! _ 4-) - 
0. This show that, as predicted by Proposition 1 ADMM gets stuck for this 
example at z*. In case when the initial point is infeasible however, the situation 
is different as we will see now. 
Case 2: Consider an infeasible point z^ < —1 and an arbitrary but fixed A*. 
Then, the minimizer of Lp with respect to x, x**! will be -1 for p — © since 
x > —] has to be satisfied. After that iteration, we have similar to the above 


163 


C Insights on ADMM 


z* = y! = _1 for p — œ and thus A**! — 0. This show exemplarily, 


that the behavior of ADMM for large p heavily depends on whether or not the 
initial point is feasible. Note that in case that in case there are also individual 
constraints for z, the situation might be even more complicated as in this case 
some kind of "alternating projections" between the constraint sets might occur 
leading to an oscillatory behavior. Moreover, note that a similar situation 
occurs if one increases p to very large values as ADMM proceeds as for 
example done in [Ers15; GHT17]. 


Example 5 (Primal feasibility is not sufficient for termination) Consider ex- 
ample 4 again neglecting the constraint x 2 —1. Let us furthermore assume 
that we use ADMM with primal feasibility (|x — z| « €) as the only termina- 
tion criterion. From Example 4 we know that |x**! — z^*!| = |a*/p|. Thus 
shows that if we choose a p large enough such that |A*/p| < e, ADMM will 
imediately terminate although even though (x**!, z**!) is not optimal. 
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Computer Science 


Distributed optimization and computer science are strongly connected. One 
reason for this is that distributed optimization has many promising applications 
in computer science beyond a multi-agent setting. One can identify applications 
in at least three branches: 


* in supercomputing [CZ97; BT89; Don+03]; 
* in network optimization [BT89; San+19; Chi+07]; 


* in machine learning, specifically 


— in image reconstruction [Yan+16; ZK14; XY13]; 


— and in support vector machines [FCG10; Boy+11]. 


In this thesis we present a particular SVM example in Chapter 2. An additional 
logistic regression example from machine learning is included in the example 
set of ALADIN-a available under [Eng20a]. 


Also from an historical perspective, many pioneers of distributed optimization 
are closely connected to computer science. John von Neumann's alternating 
projection method [Neu49] for example is closely related to ADMM; or John 
Tsitsiklis with his work on distributed and asynchronous algorithms [Tsi94; 
BT89]. These roots can also be observed in terminology such as nonlinear/in- 
teger/quadratic programming. 


Moreover, interconnections from systems and control and optimization over 
networks/theoretical informatics exist. The Bellmann-Ford algorithm for ex- 
ample was invented by one of the most famous people working in systems 
and control and pioneers of so-called dynamic programming: Richard Bell- 
mann [Bel58]. Even Dijkstra’s algorithm can be cast within the framework 
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of dynamic programming which is often considered to be a “control method" 
[Sni06]. Hence, distributed optimization can be seen as an interdisciplinary 
field connecting computer science, applied mathematics and systems and con- 
trol. 


Distributed optimization vs. parallel computing 


In computer science, a distinction between parallel computing and distributed 
computing is often made. Parallel computing usually means that computation is 
done on multiple processors with one shared memory. Distributed computing 
typically refers to the case, where computers are connected via a network 
with a distributed memory and operate via message-passing [BT89; Don+03]. 
This work focuses mainly on distributed computing altough the developed 
algorithms can also be used for parallel computing. 


Although distributed optimization techniques are interesting for parallel com- 
putation, the requirements are typically slightly different. Whereas in parallel 
computing partitioning is usually done for load balancing—i.e. the problem 
size handled by each processor is approximately the same for all processors—in 
our setting the partitioning is usually given a priori by the subsystem bound- 
aries. Furthermore, in parallel computing, communication is usually not of 
such a big concern because of the fast and high-throughput shared memory. 


Distributed optimization vs. optimization over networks 


The problems considered in this thesis are different from classical optimization 
problems in theoretical informatics defined over graphs such as shortest-path 
problems:! here we consider continuous optimization problems without inte- 
gers and with graphs only as a side-topic. 


In contrast to network optimization, in our setting complexity is mostly “in the 
nodes" rather than “in the network" in the sense that in network flow the nodes 
have typically only one decision variable and the edges are more important. In 


! Although certain interconnections may exist. The duality theory from Chapter 2 can for example 
be fruitfully used in context of network flow problems [Roc98; Ber98]. 


166 


D Distributed Optimization and Computer Science 


this thesis, we consider problems, where each node is a (complex) optimization 
problem on its own, for example itself being a dynamical system or an area 
of a power grid. Methods from network optimization are not straight-forward 
to use in this setting. An additional difference to classical optimization over 
networks is that here the edges in general come with underlying physical laws. 
Although electrical grids can be seen as a graph with complex-valued weights 
[Lei+15], for the AC grid model the nodes are coupled via nonlinear equations 
making classical algorithms from theoretical informatics hard to use. 
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In many engineering domains, systems are composed of partially independent subsystems — power 
Systems are composed of distribution and transmission systems, teams of robots are composed of 
individual robots, and chemical process systems are composed of vessels, heat exchangers and reactors. 
Often, these subsystems should reach a common goal such as satisfying a power demand with minimum 
cost, flying in a formation, or reaching an optimal set-point. 


[72] 


1 


echniques are among the most successful tools for controlling such systems 
optimally with feasibility guarantees. Yet, they are often centralized — all data has to be collected in one 
central and computationally powerful entity. Methods from distributed optimization overcome this 
limitation. Classical approaches, however, are often not applicable due to non-convexities. The present 
work develops one of the first frameworks for distributed optimization with non-convex constraints. 
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